DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

Zhong, Yifan, Huang, Xuchuan, Li, Ruochong, Zhang, Ceyao, Liang, Yitao, Yang, Yaodong, Chen, Yuanpei

Mar-5-2025–arXiv.org Artificial Intelligence

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a ``zero-shot'' environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping. Our demo and code can be found at https://dexgraspvla.github.io/.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

Mar-5-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)
- North America > United States (0.14)

Genre:
- Research Report (0.82)
- Workflow (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.68)
  - Robots > Manipulation (1.00)
  - Vision (1.00)