Goto

Collaborating Authors

 vision-language navigation




OpenVLN: Open-world Aerial Vision-Language Navigation

Lin, Peican, Sun, Gan, Liu, Chenxi, Li, Fazeng, Ren, Weihong, Cong, Yang

arXiv.org Artificial Intelligence

Abstract-- Vision-language models (VLMs) have been widely-applied in ground-based vision-language navigation (VLN). However, the vast complexity of outdoor aerial environments compounds data acquisition challenges and imposes long-horizon trajectory planning requirements on Unmanned Aerial V ehicles (UA Vs), introducing novel complexities for aerial VLN. T o address these challenges, we propose a data-efficient Open -world aerial V ision-L anguage N avigation (i.e., OpenVLN) framework, which could execute language-guided flight with limited data constraints and enhance long-horizon trajectory planning capabilities in complex aerial environments. Concurrently, we introduce a long-horizon planner for trajectory synthesis that dynamically generates precise UA V actions via value-based rewards. T o the end, we conduct sufficient navigation experiments on the TravelUA V benchmark with dataset scaling across diverse reward settings. Our method demonstrates consistent performance gains of up to 4.34% in Success Rate, 6.19% in Oracle Success Rate, and 4.07% in Success weighted by Path Length over baseline methods, validating its deployment efficacy for long-horizon UA V navigation in complex aerial environments. I. INTRODUCTION Vision-language navigation (VLN)[1] is a cornerstone task for embodied agents, it demands that agents traverse intricate, real-world environments solely via following natural-language instructions.




UAV-VLN: End-to-End Vision Language guided Navigation for UAVs

Saxena, Pranav, Raghuvanshi, Nishant, Goveas, Neena

arXiv.org Artificial Intelligence

A core challenge in AI-guided autonomy is enabling agents to navigate realistically and effectively in previously unseen environments based on natural language commands. We propose UAV-VLN, a novel end-to-end Vision-Language Navigation (VLN) framework for Unmanned Aerial Vehicles (UAVs) that seamlessly integrates Large Language Models (LLMs) with visual perception to facilitate human-interactive navigation. Our system interprets free-form natural language instructions, grounds them into visual observations, and plans feasible aerial trajectories in diverse environments. UAV-VLN leverages the common-sense reasoning capabilities of LLMs to parse high-level semantic goals, while a vision model detects and localizes semantically relevant objects in the environment. By fusing these modalities, the UAV can reason about spatial relationships, disambiguate references in human instructions, and plan context-aware behaviors with minimal task-specific supervision. To ensure robust and interpretable decision-making, the framework includes a cross-modal grounding mechanism that aligns linguistic intent with visual context. We evaluate UAV-VLN across diverse indoor and outdoor navigation scenarios, demonstrating its ability to generalize to novel instructions and environments with minimal task-specific training. Our results show significant improvements in instruction-following accuracy and trajectory efficiency, highlighting the potential of LLM-driven vision-language interfaces for safe, intuitive, and generalizable UAV autonomy.


View Invariant Learning for Vision-Language Navigation in Continuous Environments

Sun, Josh Qixuan, Xing, Xiaoying, Weng, Huaiyuan, Yeum, Chul Min, Crowley, Mark

arXiv.org Artificial Intelligence

Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent's observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.


Vision-Language Navigation with Energy-Based Policy

Neural Information Processing Systems

Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa.


LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs

Zhang, Xinyuan, Tian, Yonglin, Lin, Fei, Liu, Yue, Ma, Jing, Szatmáry, Kornélia Sára, Wang, Fei-Yue

arXiv.org Artificial Intelligence

LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UA Vs Xinyuan Zhang, Y onglin Tian, Fei Lin, Y ue Liu, Jing Ma, Korn elia S ara Szatm ary, Fei-Y ue Wang Abstract --The growing demand for intelligent logistics, particularly fine-grained terminal delivery, underscores the need for autonomous UA V (Unmanned Aerial V ehicle)-based delivery systems. However, most existing last-mile delivery studies rely on ground robots, while current UA V-based Vision-Language Navigation (VLN) tasks primarily focus on coarse-grained, long-range goals, making them unsuitable for precise terminal delivery. T o bridge this gap, we propose LogisticsVLN, a scalable aerial delivery system built on multimodal large language models (MLLMs) for autonomous terminal delivery. LogisticsVLN integrates lightweight Large Language Models (LLMs) and Visual-Language Models (VLMs) in a modular pipeline for request understanding, floor localization, object detection, and action-decision making. T o support research and evaluation in this new setting, we construct the Vision-Language Delivery (VLD) dataset within the CARLA simulator . In addition, we conduct subtask-level evaluations of each module of our system, offering valuable insights for improving the robustness and real-world deployment of foundation model-based vision-language delivery systems. I NTRODUCTION Driven by the rapid growth of e-commerce and urbanization, logistics has become an increasingly critical component of modern society [1]. In particular, there is a growing demand for stable, efficient, and user-centric terminal delivery, This work is partly supported by the Science and Technology Development Fund, Macao SAR (File no. Xinyuan Zhang is with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: zhangxinyuan23@mails.ucas.ac.cn).


Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments

Chen, Kehan, An, Dong, Huang, Yan, Xu, Rongtao, Su, Yifei, Ling, Yonggen, Reid, Ian, Wang, Liang

arXiv.org Artificial Intelligence

We address the task of Vision-Language Navigation in Continuous Environments (VLN-CE) under the zero-shot setting. Zero-shot VLN-CE is particularly challenging due to the absence of expert demonstrations for training and minimal environment structural prior to guide navigation. To confront these challenges, we propose a Constraint-Aware Navigator (CA-Nav), which reframes zero-shot VLN-CE as a sequential, constraint-aware sub-instruction completion process. CA-Nav continuously translates sub-instructions into navigation plans using two core modules: the Constraint-Aware Sub-instruction Manager (CSM) and the Constraint-Aware Value Mapper (CVM). CSM defines the completion criteria for decomposed sub-instructions as constraints and tracks navigation progress by switching sub-instructions in a constraint-aware manner. CVM, guided by CSM's constraints, generates a value map on the fly and refines it using superpixel clustering to improve navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the previous best method by 12 percent and 13 percent in Success Rate on the validation unseen splits of R2R-CE and RxR-CE, respectively. Moreover, CA-Nav demonstrates its effectiveness in real-world robot deployments across various indoor scenes and instructions.