AITopics | vision-language navigation

Country:

Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Education (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
(4 more...)

Neural Information Processing SystemsFeb-7-2026, 20:04:15 GMT

0d9e08f247ca7fbbfd5e50b7ff9cf357-Paper-Conference.pdf

information, navigation, vision-and-language navigation, (11 more...)

Country:

Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > China (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.31)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Artificial IntelligenceNov-24-2025

OpenVLN: Open-world Aerial Vision-Language Navigation

Lin, Peican, Sun, Gan, Liu, Chenxi, Li, Fazeng, Ren, Weihong, Cong, Yang

Abstract-- Vision-language models (VLMs) have been widely-applied in ground-based vision-language navigation (VLN). However, the vast complexity of outdoor aerial environments compounds data acquisition challenges and imposes long-horizon trajectory planning requirements on Unmanned Aerial V ehicles (UA Vs), introducing novel complexities for aerial VLN. T o address these challenges, we propose a data-efficient Open -world aerial V ision-L anguage N avigation (i.e., OpenVLN) framework, which could execute language-guided flight with limited data constraints and enhance long-horizon trajectory planning capabilities in complex aerial environments. Concurrently, we introduce a long-horizon planner for trajectory synthesis that dynamically generates precise UA V actions via value-based rewards. T o the end, we conduct sufficient navigation experiments on the TravelUA V benchmark with dataset scaling across diverse reward settings. Our method demonstrates consistent performance gains of up to 4.34% in Success Rate, 6.19% in Oracle Success Rate, and 4.07% in Success weighted by Path Length over baseline methods, validating its deployment efficacy for long-horizon UA V navigation in complex aerial environments. I. INTRODUCTION Vision-language navigation (VLN)[1] is a cornerstone task for embodied agents, it demands that agents traverse intricate, real-world environments solely via following natural-language instructions.

large language model, machine learning, reinforcement learning, (21 more...)

2511.06182

Country:

North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
Europe > Switzerland (0.04)
Asia > China > Liaoning Province > Shenyang (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Neural Information Processing SystemsOct-10-2025, 15:55:07 GMT

Vision-Language Navigation with Energy-Based Policy

Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering.

navigation, vision-and-language navigation, wang, (17 more...)

Country:

Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Education (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
(4 more...)

Neural Information Processing SystemsOct-8-2025, 03:06:24 GMT

0d9e08f247ca7fbbfd5e50b7ff9cf357-Paper-Conference.pdf

machine learning, natural language, navigation, (15 more...)

Country:

Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > China (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.31)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Saxena, Pranav, Raghuvanshi, Nishant, Goveas, Neena

UAV-VLN: End-to-End Vision Language guided Navigation for UAVs

arXiv.org Artificial IntelligenceOct-1-2025

A core challenge in AI-guided autonomy is enabling agents to navigate realistically and effectively in previously unseen environments based on natural language commands. We propose UAV-VLN, a novel end-to-end Vision-Language Navigation (VLN) framework for Unmanned Aerial Vehicles (UAVs) that seamlessly integrates Large Language Models (LLMs) with visual perception to facilitate human-interactive navigation. Our system interprets free-form natural language instructions, grounds them into visual observations, and plans feasible aerial trajectories in diverse environments. UAV-VLN leverages the common-sense reasoning capabilities of LLMs to parse high-level semantic goals, while a vision model detects and localizes semantically relevant objects in the environment. By fusing these modalities, the UAV can reason about spatial relationships, disambiguate references in human instructions, and plan context-aware behaviors with minimal task-specific supervision. To ensure robust and interpretable decision-making, the framework includes a cross-modal grounding mechanism that aligns linguistic intent with visual context. We evaluate UAV-VLN across diverse indoor and outdoor navigation scenarios, demonstrating its ability to generalize to novel instructions and environments with minimal task-specific training. Our results show significant improvements in instruction-following accuracy and trajectory efficiency, highlighting the potential of LLM-driven vision-language interfaces for safe, intuitive, and generalizable UAV autonomy.

large language model, machine learning, natural language, (19 more...)

doi: 10.1109/ECMR65884.2025.11163198

2504.21432

Country:

North America > United States (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Asia > India > Goa (0.04)

Genre: Research Report > New Finding (0.54)

Industry:

Transportation (0.94)
Information Technology > Robotics & Automation (0.48)
Aerospace & Defense > Aircraft (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJul-16-2025

View Invariant Learning for Vision-Language Navigation in Continuous Environments

Sun, Josh Qixuan, Xing, Xiaoying, Weng, Huaiyuan, Yeum, Chul Min, Crowley, Mark

Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent's observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.

artificial intelligence, machine learning, natural language, (18 more...)

2507.08831

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsMay-27-2025, 15:37:11 GMT

Vision-Language Navigation with Energy-Based Policy

Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa.

energy-based policy, expert policy, vision-language navigation, (1 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.44)

arXiv.org Artificial IntelligenceMay-7-2025

LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs

Zhang, Xinyuan, Tian, Yonglin, Lin, Fei, Liu, Yue, Ma, Jing, Szatmáry, Kornélia Sára, Wang, Fei-Yue

LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UA Vs Xinyuan Zhang, Y onglin Tian, Fei Lin, Y ue Liu, Jing Ma, Korn elia S ara Szatm ary, Fei-Y ue Wang Abstract --The growing demand for intelligent logistics, particularly fine-grained terminal delivery, underscores the need for autonomous UA V (Unmanned Aerial V ehicle)-based delivery systems. However, most existing last-mile delivery studies rely on ground robots, while current UA V-based Vision-Language Navigation (VLN) tasks primarily focus on coarse-grained, long-range goals, making them unsuitable for precise terminal delivery. T o bridge this gap, we propose LogisticsVLN, a scalable aerial delivery system built on multimodal large language models (MLLMs) for autonomous terminal delivery. LogisticsVLN integrates lightweight Large Language Models (LLMs) and Visual-Language Models (VLMs) in a modular pipeline for request understanding, floor localization, object detection, and action-decision making. T o support research and evaluation in this new setting, we construct the Vision-Language Delivery (VLD) dataset within the CARLA simulator . In addition, we conduct subtask-level evaluations of each module of our system, offering valuable insights for improving the robustness and real-world deployment of foundation model-based vision-language delivery systems. I NTRODUCTION Driven by the rapid growth of e-commerce and urbanization, logistics has become an increasingly critical component of modern society [1]. In particular, there is a growing demand for stable, efficient, and user-centric terminal delivery, This work is partly supported by the Science and Technology Development Fund, Macao SAR (File no. Xinyuan Zhang is with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: zhangxinyuan23@mails.ucas.ac.cn).

large language model, machine learning, natural language, (19 more...)

2505.0346

Country:

Asia > Macao (0.34)
Asia > China > Beijing > Beijing (0.25)
Europe > Hungary (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology (0.48)
Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Artificial IntelligenceDec-13-2024

Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments

Chen, Kehan, An, Dong, Huang, Yan, Xu, Rongtao, Su, Yifei, Ling, Yonggen, Reid, Ian, Wang, Liang

We address the task of Vision-Language Navigation in Continuous Environments (VLN-CE) under the zero-shot setting. Zero-shot VLN-CE is particularly challenging due to the absence of expert demonstrations for training and minimal environment structural prior to guide navigation. To confront these challenges, we propose a Constraint-Aware Navigator (CA-Nav), which reframes zero-shot VLN-CE as a sequential, constraint-aware sub-instruction completion process. CA-Nav continuously translates sub-instructions into navigation plans using two core modules: the Constraint-Aware Sub-instruction Manager (CSM) and the Constraint-Aware Value Mapper (CVM). CSM defines the completion criteria for decomposed sub-instructions as constraints and tracks navigation progress by switching sub-instructions in a constraint-aware manner. CVM, guided by CSM's constraints, generates a value map on the fly and refines it using superpixel clustering to improve navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the previous best method by 12 percent and 13 percent in Success Rate on the validation unseen splits of R2R-CE and RxR-CE, respectively. Moreover, CA-Nav demonstrates its effectiveness in real-world robot deployments across various indoor scenes and instructions.

large language model, machine learning, natural language, (19 more...)

2412.10137

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > China (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)