source localization
- Asia > China (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Health & Medicine (1.00)
- Energy (0.94)
- Information Technology > Security & Privacy (0.67)
- Information Technology > Data Science (0.93)
- Information Technology > Communications > Networks (0.93)
- Information Technology > Artificial Intelligence > Robots (0.93)
- (3 more...)
- North America > United States (0.46)
- Asia > Middle East > Iran (0.04)
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- Workflow (1.00)
- Research Report > Experimental Study (1.00)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia (0.04)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Epidemiology (1.00)
- Asia > China > Shanghai > Shanghai (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Epidemiology (1.00)
- Information Technology (0.68)
Learning Spatially-Aware Language and Audio Embeddings
Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like the lion roar came from right behind me!. For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of behind is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models, such as CLAP, which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., next to me).
Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization
Panah, Davoud Shariat, Ragano, Alessandro, Barry, Dan, Skoglund, Jan, Hines, Andrew
ABSTRACT This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolu-tional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)