UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms

Guo, Xueyang, Hu, Hongwei, Song, Chengye, Chen, Jiale, Zhao, Zilin, Fu, Yu, Guan, Bowen, Liu, Zhenze

May-13-2025–arXiv.org Artificial Intelligence

UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-V ocabulary Constrained Grasping with Dual Arms Xueyang Guo 1*, Hongwei Hu 1*, Chengye Song 1, Jiale Chen 1, Zilin Zhao 2, Y u Fu 3, Bowen Guan 4, and Zhenze Liu 1 Abstract -- Open-vocabulary, task-oriented grasping of specific functional parts, particularly with dual arms, remains a key challenge, as current Vision-Language Models (VLMs), while enhancing task understanding, often struggle with precise grasp generation within defined constraints and effective dual-arm coordination. We innovatively propose UniDiffGrasp, a unified framework integrating VLM reasoning with guided part diffusion to address these limitations. UniDiffGrasp leverages a VLM to interpret user input and identify semantic targets (object, part(s), mode), which are then grounded via open-vocabulary segmentation. Critically, the identified parts directly provide geometric constraints for a Constrained Grasp Diffusion Field (CGDF) using its Part-Guided Diffusion, enabling efficient, high-quality 6-DoF grasps without retraining. For dual-arm tasks, UniDiffGrasp defines distinct target regions, applies part-guided diffusion per arm, and selects stable cooperative grasps. Through extensive real-world deployment, UniDiffGrasp achieves grasp success rates of 0.876 in single-arm and 0.767 in dual-arm scenarios, significantly surpassing existing state-of-the-art methods, demonstrating its capability to enable precise and coordinated open-vocabulary grasping in complex real-world scenarios. I. INTRODUCTION The ambition for robots to seamlessly integrate into human environments as capable assistants hinges on their ability to perform dexterous, task-oriented manipulation.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

May-13-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.15)

Genre:
- Research Report (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Robots (1.00)
  - Machine Learning (1.00)
  - Representation & Reasoning > Constraint-Based Reasoning (0.49)
  - Natural Language > Large Language Model (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found