Pay Self-Attention to Audio-Visual Navigation
Yu, Yinfeng, Cao, Lele, Sun, Fuchun, Liu, Xiaohong, Wang, Liejun
–arXiv.org Artificial Intelligence
Audio-visual embodied navigation, as a hot research topic, aims training a robot to reach an audio target using egocentric visual (from the sensors mounted on the robot) and audio (emitted from the target) input. The audio-visual information fusion strategy is naturally important to the navigation performance, but the state-of-the-art methods still simply concatenate the visual and audio features, potentially ignoring the direct impact of context. Moreover, the existing approaches requires either phase-wise training or additional aid (e.g. topology graph and sound semantics). Up till this date, the work that deals with the more challenging setup with moving target(s) is still rare. As a result, we propose an end-to-end framework FSAAVN (feature self-attention audio-visual navigation) to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy implemented as a self-attention module. Our thorough experiments validate the superior performance (both quantitatively and qualitatively) of FSAAVN in comparison with the state-of-the-arts, and also provide unique insights about the choice of visual modalities, visual/audio encoder backbones and fusion patterns.
arXiv.org Artificial Intelligence
Oct-5-2022
- Country:
- North America
- United States
- Washington > King County
- Seattle (0.04)
- Utah > Salt Lake County
- Salt Lake City (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Washington > King County
- Canada > Quebec
- Montreal (0.04)
- United States
- Europe
- Asia
- Macao (0.04)
- South Korea > Seoul
- Seoul (0.04)
- China
- Shandong Province > Qingdao (0.04)
- Beijing > Beijing (0.04)
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- North America
- Genre:
- Research Report (0.70)
- Industry:
- Media (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Robots (1.00)
- Machine Learning > Neural Networks (0.46)
- Representation & Reasoning > Information Fusion (0.34)
- Information Technology > Artificial Intelligence