Audio-Guided Dynamic Modality Fusion with Stereo-Aware Attention for Audio-Visual Navigation

Li, Jia, Yu, Yinfeng, Wang, Liejun, Sun, Fuchun, Zheng, Wendong

Sep-23-2025–arXiv.org Artificial Intelligence

In audio-visual navigation (A VN) tasks, an embodied agent must autonomously localize a sound source in unknown and complex 3D environments based on audio-visual signals. Existing methods often rely on static modality fusion strategies and neglect the spatial cues embedded in stereo audio, leading to performance degradation in cluttered or occluded scenes. To address these issues, we propose an end-to-end reinforcement learning-based AVN framework with two key innovations: (1) a Stereo-Aware Attention Module (SAM), which learns and exploits the spatial disparity between left and right audio channels to enhance directional sound perception; and (2) an Audio-Guided Dynamic Fusion Module (AGDF), which dynamically adjusts the fusion ratio between visual and auditory features based on audio cues, thereby improving robustness to environmental changes. Extensive experiments are conducted on two realistic 3D scene datasets, Replica and Matterport3D, demonstrating that our method significantly outperforms existing approaches in terms of navigation success rate and path efficiency. Notably, our model achieves over 40% improvement under audio-only conditions compared to the best-performing baselines.

machine learning, navigation, reinforcement learning, (10 more...)

arXiv.org Artificial Intelligence

Sep-23-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.29)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.67)
  - Representation & Reasoning > Agents (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found