SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Oct-10-2024, 02:29:09 GMT–Neural Information Processing Systems

Natural language instructions for visual navigation often use scene descriptions (e.g., bedroom) and object references (e.g., green chairs) to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-language navigation (VLN) agent that uses two different visual encoders -- a scene classification network and an object detector -- which produce features that match these two distinct types of visual cues. In our method, scene features contribute high-level contextual information that supports object-level processing. With this design, our model is able to use vision-and-language pretraining (i.e., learning the alignment between images and text from large-scale web data) to substantially improve performance on the Room-to-Room (R2R) and Room-Across-Room (RxR) benchmarks. Specifically, our approach leads to improvements of 1.8% absolute in SPL on R2R and 3.7% absolute in SR on RxR.

instruction, scene-and object-aware transformer, vision-and-language navigation, (1 more...)

Neural Information Processing Systems

Oct-10-2024, 02:29:09 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.63)
  - Machine Learning (0.43)