OpenVLN: Open-world Aerial Vision-Language Navigation

Lin, Peican, Sun, Gan, Liu, Chenxi, Li, Fazeng, Ren, Weihong, Cong, Yang

arXiv.org Artificial Intelligence 

Abstract-- Vision-language models (VLMs) have been widely-applied in ground-based vision-language navigation (VLN). However, the vast complexity of outdoor aerial environments compounds data acquisition challenges and imposes long-horizon trajectory planning requirements on Unmanned Aerial V ehicles (UA Vs), introducing novel complexities for aerial VLN. T o address these challenges, we propose a data-efficient Open -world aerial V ision-L anguage N avigation (i.e., OpenVLN) framework, which could execute language-guided flight with limited data constraints and enhance long-horizon trajectory planning capabilities in complex aerial environments. Concurrently, we introduce a long-horizon planner for trajectory synthesis that dynamically generates precise UA V actions via value-based rewards. T o the end, we conduct sufficient navigation experiments on the TravelUA V benchmark with dataset scaling across diverse reward settings. Our method demonstrates consistent performance gains of up to 4.34% in Success Rate, 6.19% in Oracle Success Rate, and 4.07% in Success weighted by Path Length over baseline methods, validating its deployment efficacy for long-horizon UA V navigation in complex aerial environments. I. INTRODUCTION Vision-language navigation (VLN)[1] is a cornerstone task for embodied agents, it demands that agents traverse intricate, real-world environments solely via following natural-language instructions.