EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning

Lin, Bingqian, Nie, Yunshuang, Zai, Khun Loun, Wei, Ziming, Han, Mingfei, Xu, Rongtao, Niu, Minzhe, Han, Jianhua, Zhang, Hanwang, Lin, Liang, Chen, Bokui, Lu, Cewu, Liang, Xiaodan

arXiv.org Artificial Intelligence 

Abstract--Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for enhancing vision-language navigation (VLN) performance, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches predominantly adopt straightforward input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. T o address these issues, we propose EvolveNav, a novel sElf-improving embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning for boosting LLM-based vision-language Navigation. Specifically, EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-T uning, where we train the model with curated formalized CoT labels to first activate the model's navigational reasoning These two authors contribute equally to this work. Bokui Chen, Cewu Lu, and Xiaodan Liang are the corresponding authors. Bingqian Lin and Cewu Lu are with Shanghai Jiao T ong University, Shanghai, China. Y unshuang Nie, Khun Loun Zai, and Ziming Wei are with Shenzhen Campus of Sun Y at-sen University, Shenzhen, China. Xiaodan Liang is with Shenzhen Campus of Sun Y at-sen University, Shenzhen, China, Peng Cheng Laboratory, Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, China. Bokui Chen is with T singhua Shenzhen International Graduate School, T singhua University, China. Mingfei Han is with the Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE.