VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

Castro, Mateo Guaman, Rajagopal, Sidharth, Gorbatov, Daniel, Schmittle, Matt, Baijal, Rohan, Zhang, Octi, Scalise, Rosario, Talia, Sidharth, Romig, Emma, de Melo, Celso, Boots, Byron, Gupta, Abhishek

Oct-24-2025–arXiv.org Artificial Intelligence

Our key idea is to decouple semantic planning from embodiment grounding. We achieve this by training a high-level VLM planner with diverse, heterogeneous real-world data that proposes trajectory candidates as 2D paths, which are then re-ranked by an embodiment-specific affordance model trained cheaply and safely in simulation. Abstract-- A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3 higher success rates by rejecting physically infeasible plans. A core problem in robotics is determining how robots can navigate to a goal location while traversing non-trivial terrain and obstacles. The promise of general-purpose robot navigation-- i.e., performing well across diverse environments, different embodiments, and being easy to steer during deployment--has motivated a shift from hand-designed modular stacks to learning-based approaches that leverage large-scale data. Recent advances in robotic foundation models have shown that performance scales with the amount of diverse data provided [1], [2], [3], [4]. However, as datasets scale, so does their heterogeneity. This becomes a critical challenge when a downstream robot is physically incapable of achieving the entirety of behaviors recorded in a pooled, multi-robot dataset. For instance, data from a quadruped navigating stairs is of limited use to a wheeled robot. This creates a bottleneck that prevents us from naively combining all available data and achieving reliable navigation performance.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-24-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Austria
  - Vienna (0.14)
- North America > Montserrat (0.04)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.93)

Industry:
- Transportation (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.46)
  - Robots > Locomotion (0.74)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found