SONAR: Semantic-Object Navigation with Aggregated Reasoning through a Cross-Modal Inference Paradigm

Wang, Yao, Sun, Zhirui, Chi, Wenzheng, Jia, Baozhi, Xu, Wenjun, Wang, Jiankun

arXiv.org Artificial Intelligence 

Noname manuscript No. (will be inserted by the editor) Abstract Understanding human instructions and accomplishing Vision-Language Navigation tasks in unknown environments is essential for robots. However, existing modular approaches heavily rely on the quality of training data and often exhibit poor generalization. Vision-Language Model based methods, while demonstrating strong generalization capabilities, tend to perform unsatisfactorily when semantic cues are weak. To address these issues, this paper proposes SONAR, an aggregated reasoning approach through a cross modal paradigm. The proposed method integrates a semantic map based target prediction module with a Vision-Language Model based value map module, enabling more robust navigation in unknown environments with varying levels of semantic cues, and effectively balancing generalization ability with scene adaptability. In terms of target localization, we propose a strategy that integrates multi-scale semantic maps with confidence maps, aiming to mitigate false detections of target objects. We conducted an evaluation of the SONAR within the Gazebo simulator, leveraging the most challenging Mat-null Jiankun Wang E-mail: wangjk@sustech.edu.cn Experimental results demonstrate that SONAR achieves a success rate of 38.4% and an SPL of 17.7%. Keywords Object Goal Navigation Vision-Language Model Aggregated Reasoning 1 Introduction In an unknown environment, for a robot to accurately understand human instructions and complete vision language navigation tasks, it needs to rely on limited visual and linguistic cues to develop efficient exploration strategies while achieving precise identification of target objects[1].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found