VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

Zhao, Han, Zhang, Jiaxuan, Song, Wenxuan, Ding, Pengxiang, Wang, Donglin

Oct-17-2025–arXiv.org Artificial Intelligence

Abstract-- Current vision-language-action (VLA) models, pre-trained on large-scale robotic data, exhibit strong multi-task capabilities and generalize well to variations in visual and language instructions for manipulation. However, their success rate drops significantly when faced with object concepts outside the training data, such as unseen object descriptions and textures in the dataset. Based on the LIBERO simulation environment, we introduced novel objects and object descriptions to construct a new evaluation benchmark with three difficulty levels to test the effectiveness of our method. Our framework successfully outperformed the current state-of-the-art models on our designed hard-level generalization benchmark. I. INTRODUCTION In recent years, foundation models have profoundly influenced the development of artificial intelligence research. In the field of robotics, Vision-Language-Action (VLA) models [10]-[16] built upon vision-language models represent a prominent research paradigm. This approach effectively harnesses the learning capacity of large-scale models and shows strong potential to serve as a foundational backbone for general-purpose robots performing manipulation tasks in open-world environments in the future. In evaluation involving unseen concepts (i.e., object textures and language descriptions outside the dataset), our proposed framework surpasses other state-of-the-art models finetuned on the original LIBERO dataset. In contrast, the reproduced Agentic Robot framework [17] using our model exhibits a significantly noticeable performance degradation in this task. Some researchers have attempted to jointly train robotic manipulation data with web-scale multimodal data [10], [14], aiming to preserve extensive conceptual knowledge during training and thereby enhance generalization in manipulation tasks.

artificial intelligence, blue white porcelain bowl, module, (15 more...)

arXiv.org Artificial Intelligence

Oct-17-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)

Genre:
- Research Report > Promising Solution (0.54)

Industry:
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence > Robots (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found