Seeing Sound Hearing Sight Uncovering Modality Bias and Conflict of AI models in Sound Localization
–Neural Information Processing Systems
Imagine hearing a dog bark and instinctively turning toward the sound--only to find a parked car, while a silent dog sits nearby. Such moments of sensory conflict challenge perception, yet humans flexibly resolve these discrepancies, prioritizing auditory cues over misleading visuals to accurately localize sounds. Despite the rapid advancement of multimodal AI models that integrate vision and sound, little is known about how these systems handle cross-modal conflicts or whether they favor one modality over another. Here, we systematically and quantitatively examine modality bias and conflict resolution in AI models for Sound Source Localization (SSL). We evaluate a wide range of state-of-the-art multimodal models and compare them against human performance in psychophysics experiments spanning six audiovisual conditions, including congruent, conflicting, and absent visual and audio cues.
Neural Information Processing Systems
Jun-22-2026, 22:33:26 GMT
- Country:
- Asia (0.46)
- North America > United States (0.28)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Health & Medicine (0.93)
- Technology: