UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

Zhang, Jianke, Guo, Yanjiang, Hu, Yucheng, Chen, Xiaoyu, Zhu, Xiang, Chen, Jianyu

Feb-2-2025–arXiv.org Artificial Intelligence

Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-2-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)

Genre:
- Research Report
  - New Finding (0.34)
  - Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Natural Language > Text Processing (0.48)
  - Representation & Reasoning
    - Agents (0.42)
    - Spatial Reasoning (0.68)
  - Vision (1.00)