Pay Self-Attention to Audio-Visual Navigation

Yu, Yinfeng, Cao, Lele, Sun, Fuchun, Liu, Xiaohong, Wang, Liejun

Oct-5-2022–arXiv.org Artificial Intelligence

Audio-visual embodied navigation, as a hot research topic, aims training a robot to reach an audio target using egocentric visual (from the sensors mounted on the robot) and audio (emitted from the target) input. The audio-visual information fusion strategy is naturally important to the navigation performance, but the state-of-the-art methods still simply concatenate the visual and audio features, potentially ignoring the direct impact of context. Moreover, the existing approaches requires either phase-wise training or additional aid (e.g. topology graph and sound semantics). Up till this date, the work that deals with the more challenging setup with moving target(s) is still rare. As a result, we propose an end-to-end framework FSAAVN (feature self-attention audio-visual navigation) to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy implemented as a self-attention module. Our thorough experiments validate the superior performance (both quantitatively and qualitatively) of FSAAVN in comparison with the state-of-the-arts, and also provide unique insights about the choice of visual modalities, visual/audio encoder backbones and fusion patterns.

artificial intelligence, machine learning, pay self-attention, (17 more...)

arXiv.org Artificial Intelligence

Oct-5-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Utah > Salt Lake County
      - Salt Lake City (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Hawaii > Honolulu County
      - Honolulu (0.04)
    - California > Los Angeles County
      - Long Beach (0.04)
  - Canada > Quebec
    - Montreal (0.04)
- Europe
  - Austria (0.04)
  - Sweden > Stockholm
    - Stockholm (0.04)
  - Germany > Baden-Württemberg
    - Freiburg (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
- Asia
  - Macao (0.04)
  - South Korea > Seoul
    - Seoul (0.04)
  - China
    - Shandong Province > Qingdao (0.04)
    - Beijing > Beijing (0.04)
- Africa > Ethiopia
  - Addis Ababa > Addis Ababa (0.04)

Genre:
- Research Report (0.70)

Industry:
- Media (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Robots (1.00)
  - Machine Learning > Neural Networks (0.46)
  - Representation & Reasoning > Information Fusion (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found