Zhou, Gengze
Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments
Li, Zerui, Zhou, Gengze, Hong, Haodong, Shao, Yanyan, Lyu, Wenqi, Qiao, Yanyuan, Wu, Qi
-- Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, generalization remains a persistent challenge, particularly when dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
Zhou, Gengze, Hong, Yicong, Wang, Zun, Zhao, Chongyang, Bansal, Mohit, Wu, Qi
Subsequent The academic field of learning instruction-guided visual works leverage generic vision-language representations navigation can be generally categorized into high-level [18, 59, 61, 96, 97] to pretrain vision-language-action category-specific search and low-level language-guided policies [14, 16, 32, 34, 36, 60, 74, 81] (Figure 1b), finetuning navigation, depending on the granularity of language instruction, parameters for specific tasks while maintaining the in which the former emphasizes the exploration same model architecture. In this paper, we argue that the process, while the latter concentrates on following detailed essential difference between these tasks lies in the granularity textual commands. Despite the differing focuses of these of instruction, and the learning problems should be unified tasks, the underlying requirements of interpreting instructions, under the broader concept of language-guided visual comprehending the surroundings, and inferring action navigation (VLN), where the overarching goal is to create decisions remain consistent. This paper consolidates a versatile system that can interpret and execute arbitrary diverse navigation tasks into a unified and generic framework language instructions (Figure 1c).
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
Zhou, Gengze, Hong, Yicong, Wang, Zun, Wang, Xin Eric, Wu, Qi
Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists.
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Zhang, Jiazhao, Wang, Kunyu, Xu, Rongtao, Zhou, Gengze, Hong, Yicong, Fang, Xiaomeng, Wu, Qi, Zhang, Zhizheng, Wang, He
Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Zhou, Gengze, Hong, Yicong, Wu, Qi
Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.