General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

Lange, Bernard, Yildiz, Anil, Arief, Mansur, Khattak, Shehryar, Kochenderfer, Mykel, Georgakis, Georgios

arXiv.org Artificial Intelligence 

Abstract-- Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed information flows, limiting their generalizability. Large Vision-Language Models (L VLMs) offer a promising alternative by embedding human-like knowledge for reasoning and planning, but prior L VLM-robot integrations have largely depended on pre-mapped spaces, hard-coded representations, and rigid control logic. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose framework that equips an L VLM-based agent with a library of perception, reasoning, and navigation tools drawn from modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query modules, reason over multimodal inputs, and select navigation actions. This agentic formulation enables robust navigation and reasoning in previously unmapped environments, offering a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA outperforms state-of-the-art EQA-specific approaches. Qualitative results on RxR and custom tasks further demonstrate its ability to generalize across a broad range of navigation challenges. Developing general-purpose navigation robots that can accomplish diverse tasks in unknown environments from natural language instructions remains a core challenge in robotics.