Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation

Xie, Yifan, Ou, Binkai, Ma, Fei, Liu, Yaohua

arXiv.org Artificial Intelligence 

Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation Yifan Xie 1, 2, Binkai Ou 3, Fei Ma 4, and Y aohua Liu 2 Abstract -- Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions. In this paper, we propose OIKG (Observation-graph Interaction and Key-detail Guidance), a novel framework that addresses these limitations through two key components: (1) an observation-graph interaction module that decouples angular and visual information while strengthening edge representations in the navigation space, and (2) a key-detail guidance module that dynamically extracts and utilizes fine-grained location and object information from instructions. By enabling more precise cross-modal alignment and dynamic instruction interpretation, our approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR datasets demonstrate that OIKG achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of our method in enhancing navigation precision through better observation-instruction alignment. I. INTRODUCTION Vision and Language Navigation (VLN) [1] is a challenging task that requires an AI agent to navigate through complex 3D environments [2], [3] by following natural language instructions. In this task, agents must process visual information from their surroundings while interpreting detailed navigation instructions to reach specified target locations. This involves understanding spatial relationships, recognizing objects and landmarks, and making sequential navigation decisions based on the cross-modal alignment between visual observations and linguistic guidance.