soundscape
Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization
Zhang, Ellie L., Liao, Duoduo, Liao, Callie C.
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dynamic inter-bird interactions, all of which require precise temporal and spatial control in 3D environments. Existing approaches, whether Digital Signal Processing (DSP)-based or data-driven, typically focus only on single species modeling, static call structures, or synthesis directly from recordings, and often suffer from noise, limited flexibility, or large data needs. To address these challenges, we present a novel, fully algorithm-driven framework that generates dynamic multi-species bird soundscapes using DSP-based chirp generation and 3D spatialization, without relying on recordings or training data. Our approach simulates multiple independently-moving birds per species along different moving 3D trajectories, supporting controllable chirp sequences, overlapping choruses, and realistic 3D motion in scalable soundscapes while preserving species-specific acoustic patterns. A visualization interface provides bird trajectories, spectrograms, activity timelines, and sound waves for analytical and creative purposes. Both visual and audio evaluations demonstrate the ability of the system to generate dense, immersive, and ecologically inspired soundscapes, highlighting its potential for computer music, interactive virtual environments, and computational bioacoustics research.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Tyne and Wear > Sunderland (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Media > Music (0.88)
- Leisure & Entertainment (0.88)
Audio Geolocation: A Natural Sounds Benchmark
Chasmai, Mustafa, Liu, Wuao, Maji, Subhransu, Van Horn, Grant
Can we determine someone's geographic location purely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? We tackle the challenge of global-scale audio geolocation, formalize the problem, and conduct an in-depth analysis with wildlife audio from the iNatSounds dataset. Adopting a vision-inspired approach, we convert audio recordings to spectrograms and benchmark existing image geolocation techniques. We hypothesize that species vocalizations offer strong geolocation cues due to their defined geographic ranges and propose an approach that integrates species range prediction with retrieval-based geolocation. We further evaluate whether geolocation improves when analyzing species-rich recordings or when aggregating across spatiotemporal neighborhoods. Finally, we introduce case studies from movies to explore multimodal geolocation using both audio and visual content. Our work highlights the advantages of integrating audio and visual cues, and sets the stage for future research in audio geolocation.
- Leisure & Entertainment (1.00)
- Media > Film (0.93)
- Information Technology > Artificial Intelligence > Speech (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Communications > Social Media (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Domain Adaptation Method and Modality Gap Impact in Audio-Text Models for Prototypical Sound Classification
Acevedo, Emiliano, Rocamora, Martín, Fuentes, Magdalena
Audio-text models are widely used in zero-shot environmental sound classification as they alleviate the need for annotated data. However, we show that their performance severely drops in the presence of background sound sources. Our analysis reveals that this degradation is primarily driven by SNR levels of background soundscapes, and independent of background type. To address this, we propose a novel method that quantifies and integrates the contribution of background sources into the classification process, improving performance without requiring model retraining. Our domain adaptation technique enhances accuracy across various backgrounds and SNR conditions. Moreover, we analyze the modality gap between audio and text embeddings, showing that narrowing this gap improves classification performance.
- North America > United States (0.05)
- South America > Uruguay (0.04)
- Europe > Spain (0.04)
- Europe > Hungary > Budapest > Budapest (0.04)
SounDiT: Geo-Contextual Soundscape-to-Landscape Generation
Wang, Junbo, Tan, Haofeng, Liao, Bowen, Jiang, Albert, Fei, Teng, Huang, Qixing, Tu, Zhengzhong, Ye, Shan, Kang, Yuhao
We present a novel and practically significant problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation-which aims to synthesize geographically realistic landscape images from environmental soundscapes. Prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned with real-world environmental settings. To address this limitation, we introduce a novel geo-contextual computational framework that explicitly integrates geographic knowledge into multimodal generative modeling. We construct two large-scale geo-contextual multimodal datasets, SoundingSVI and SonicUrban, pairing diverse soundscapes with real-world landscape images. We propose SounDiT, a novel Diffusion Transformer (DiT)-based model that incorporates geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose a practically-informed geo-contextual evaluation framework, the Place Similarity Score (PSS), across element-, scene-, and human perception-levels to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in both visual fidelity and geographic settings. Our work not only establishes foundational benchmarks for GeoS2L generation but also highlights the importance of incorporating geographic domain knowledge in advancing multimodal generative models, opening new directions at the intersection of generative AI, geography, urban planning, and environmental sciences.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (8 more...)
A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration
Selvamani, Shaja Arul, Ganapathy, Nia D'Souza
This research introduces an innovative AI-driven multi-agent framework specifically designed for creating immersive audiobooks. Leveraging neural text-to-speech synthesis with FastSpeech 2 and VALL-E for expressive narration and character-specific voices, the framework employs advanced language models to automatically interpret textual narratives and generate realistic spatial audio effects. These sound effects are dynamically synchronized with the storyline through sophisticated temporal integration methods, including Dynamic Time Warping (DTW) and recurrent neural networks (RNNs). Diffusion-based generative models combined with higher-order ambisonics (HOA) and scattering delay networks (SDN) enable highly realistic 3D soundscapes, substantially enhancing listener immersion and narrative realism. This technology significantly advances audiobook applications, providing richer experiences for educational content, storytelling platforms, and accessibility solutions for visually impaired audiences. Future work will address personalization, ethical management of synthesized voices, and integration with multi-sensory platforms.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- Health & Medicine (1.00)
- Media > Publishing (0.65)
- Information Technology > Security & Privacy (0.49)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
ECOSoundSet: a finely annotated dataset for the automated acoustic identification of Orthoptera and Cicadidae in North, Central and temperate Western Europe
Funosas, David, Massol, Elodie, Bas, Yves, Schmidt, Svenja, Arend, Dominik, Gebhard, Alexander, Barbaro, Luc, König, Sebastian, Font, Rafael Carbonell, Sannier, David, Deroussen, Fernand, Sueur, Jérôme, Roesti, Christian, Trilar, Tomi, Forstmeier, Wolfgang, Roger, Lucas, Matheu, Eloïsa, Guzik, Piotr, Barataud, Julien, Pelozuelo, Laurent, Puissant, Stéphane, Mueller, Sandra, Schuller, Björn, Montoya, Jose M., Triantafyllopoulos, Andreas, Cauchoix, Maxime
Currently available tools for the automated acoustic recognition of European insects in natural soundscapes are limited in scope. Large and ecologically heterogeneous acoustic datasets are currently needed for these algorithms to cross-contextually recognize the subtle and complex acoustic signatures produced by each species, thus making the availability of such datasets a key requisite for their development. Here we present ECOSoundSet (European Cicadidae and Orthoptera Sound dataSet), a dataset containing 10,653 recordings of 200 orthopteran and 24 cicada species (217 and 26 respective taxa when including subspecies) present in North, Central, and temperate Western Europe (Andorra, Belgium, Denmark, mainland France and Corsica, Germany, Ireland, Luxembourg, Monaco, Netherlands, United Kingdom, Switzerland), collected partly through targeted fieldwork in South France and Catalonia and partly through contributions from various European entomologists. The dataset is composed of a combination of coarsely labeled recordings, for which we can only infer the presence, at some point, of their target species (weak labeling), and finely annotated recordings, for which we know the specific time and frequency range of each insect sound present in the recording (strong labeling). We also provide a train/validation/test split of the strongly labeled recordings, with respective approximate proportions of 0.8, 0.1 and 0.1, in order to facilitate their incorporation in the training and evaluation of deep learning algorithms. This dataset could serve as a meaningful complement to recordings already available online for the training of deep learning algorithms for the acoustic classification of orthopterans and cicadas in North, Central, and temperate Western Europe.
- Europe > Western Europe (0.81)
- Europe > Netherlands (0.24)
- Europe > Monaco (0.24)
- (26 more...)
Self-Supervised Audio-Visual Soundscape Stylization
Li, Tingle, Wang, Renhao, Huang, Po-Yao, Owens, Andrew, Anumanchipalli, Gopala
Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/
- North America > United States > Michigan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Leisure & Entertainment (0.67)
- Media > Music (0.46)
Long-Term, Store-Front Robotics: Interactive Music for Robotic Arm, Caxixi and Frame Drums
Savery, Richard, Sukkar, Fouad
This paper presents an innovative exploration into the integration of interactive robotic musicianship within a commercial retail environment, specifically through a three-week-long in-store installation featuring a UR3 robotic arm, custom-built frame drums, and an adaptive music generation system. Situated in a prominent storefront in one of the world's largest cities, this project aimed to enhance the shopping experience by creating dynamic, engaging musical interactions that respond to the store's ambient soundscape. Key contributions include the novel application of industrial robotics in artistic expression, the deployment of interactive music to enrich retail ambiance, and the demonstration of continuous robotic operation in a public setting over an extended period. Challenges such as system reliability, variation in musical output, safety in interactive contexts, and brand alignment were addressed to ensure the installation's success. The project not only showcased the technical feasibility and artistic potential of robotic musicianship in retail spaces but also offered insights into the practical implications of such integration, including system reliability, the dynamics of human-robot interaction, and the impact on store operations. This exploration opens new avenues for enhancing consumer retail experiences through the intersection of technology, music, and interactive art, suggesting a future where robotic musicianship contributes meaningfully to public and commercial spaces.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Research Report (0.82)
- Overview > Innovation (0.34)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Towards Deep Active Learning in Avian Bioacoustics
Rauch, Lukas, Huseljic, Denis, Wirth, Moritz, Decke, Jens, Sick, Bernhard, Scholz, Christoph
Passive acoustic monitoring (PAM) in avian bioacoustics enables cost-effective and extensive data collection with minimal disruption to natural habitats. Despite advancements in computational avian bioacoustics, deep learning models continue to encounter challenges in adapting to diverse environments in practical PAM scenarios. This is primarily due to the scarcity of annotations, which requires labor-intensive efforts from human experts. Active learning (AL) reduces annotation cost and speed ups adaption to diverse scenarios by querying the most informative instances for labeling. This paper outlines a deep AL approach, introduces key challenges, and conducts a small-scale pilot study.
- North America > United States > Nevada (0.05)
- Europe > Switzerland (0.05)
- Europe > Germany (0.05)
- South America > Peru (0.04)
Sound Tagging in Infant-centric Home Soundscapes
Khan, Mohammad Nur Hossain, Li, Jialu, McElwain, Nancy L., Hasegawa-Johnson, Mark, Islam, Bashima
Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- Asia > China > Anhui Province > Hefei (0.04)