GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

Abouzeid, Ali, Mansour, Malak, Sun, Zezhou, Song, Dezhen

Nov-10-2025–arXiv.org Artificial Intelligence

Vision-Language-Action (VLA) models often fail to generalize to novel camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on LIBERO benchmark subsets, we show GeoAware-VLA achieves substantial improvements in zero-shot generalization to novel camera poses, boosting success rates by over 2x in simulation. Crucially, these benefits translate to the physical world; our model shows a significant performance gain on a real robot, especially when evaluated from unseen camera angles. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key component for creating more generalizable robotic agents.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Nov-10-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks (0.68)
    - Statistical Learning (0.46)
  - Natural Language (1.00)
  - Robots (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found