Goto

Collaborating Authors

 grasping



DexSinGrasp: Learning a Unified Policy for Dexterous Object Singulation and Grasping in Densely Cluttered Environments

Xu, Lixin, Liu, Zixuan, Gui, Zhewei, Guo, Jingxiang, Jiang, Zeyu, Zhang, Tongzhou, Xu, Zhixuan, Gao, Chongkai, Shao, Lin

arXiv.org Artificial Intelligence

Abstract-- Grasping objects in cluttered environments remains a fundamental yet challenging problem in robotic manipulation. While prior works have explored learning-based synergies between pushing and grasping for two-fingered grippers, few have leveraged the high degrees of freedom (DoF) in dexterous hands to perform efficient singulation for grasping in cluttered settings. In this work, we introduce DexSinGrasp, a unified policy for dexterous object singulation and grasping. DexSinGrasp enables high-dexterity object singulation to facilitate grasping, significantly improving efficiency and effectiveness in cluttered environments. We incorporate clutter arrangement curriculum learning to enhance success rates and generalization across diverse clutter conditions, while policy distillation enables a deploy-able vision-based grasping strategy. T o evaluate our approach, we introduce a set of cluttered grasping tasks with varying object arrangements and occlusion levels. Experimental results show that our method outperforms baselines in both efficiency and grasping success rate, particularly in dense clutter . Dexterous grasping of target objects in cluttered environments is crucial for various applications, from production lines [1] to assembly processes [2], [3] and beyond.


Hoecken-D Hand: A Novel Robotic Hand for Linear Parallel Pinching and Self-Adaptive Grasping

Guo, Wentao, Zhang, Wenzeng

arXiv.org Artificial Intelligence

This paper presents the Hoecken-D Hand, an underactuated robotic gripper that combines a modified Hoecken linkage with a differential spring mechanism to achieve both linear parallel pinching and a mid-stroke transition to adaptive envelope. The original Hoecken linkage is reconfigured by replacing one member with differential links, preserving straight-line guidance while enabling contact-triggered reconfiguration without additional actuators. A double-parallelogram arrangement maintains fingertip parallelism during conventional pinching, whereas the differential mechanism allows one finger to wrap inward upon encountering an obstacle, improving stability on irregular or thin objects. The mechanism can be driven by a single linear actuator, minimizing complexity and cost; in our prototype, each finger is driven by its own linear actuator for simplicity. We perform kinematic modeling and force analysis to characterize grasp performance, including simulated grasping forces and spring-opening behavior under varying geometric parameters. The design was prototyped using PLA-based 3D printing, achieving a linear pinching span of approximately 200 mm. Preliminary tests demonstrate reliable grasping in both modes across a wide range of object geometries, highlighting the Hoecken-D Hand as a compact, adaptable, and cost-effective solution for manipulation in unstructured environments.


A Novel Robot Hand with Hoeckens Linkages and Soft Phalanges for Scooping and Self-Adaptive Grasping in Environmental Constraints

Guo, Wentao, Wang, Yizhou, Zhang, Wenzeng

arXiv.org Artificial Intelligence

This paper presents a novel underactuated adaptive robotic hand, Hockens-A Hand, which integrates the Hoeckens mechanism, a double-parallelogram linkage, and a specialized four-bar linkage to achieve three adaptive grasping modes: parallel pinching, asymmetric scooping, and enveloping grasping. Hockens-A Hand requires only a single linear actuator, leveraging passive mechanical intelligence to ensure adaptability and compliance in unstructured environments. Specifically, the vertical motion of the Hoeckens mechanism introduces compliance, the double-parallelogram linkage ensures line contact at the fingertip, and the four-bar amplification system enables natural transitions between different grasping modes. Additionally, the inclusion of a mesh-textured silicone phalanx further enhances the ability to envelop objects of various shapes and sizes. This study employs detailed kinematic analysis to optimize the push angle and design the linkage lengths for optimal performance. Simulations validated the design by analyzing the fingertip motion and ensuring smooth transitions between grasping modes. Furthermore, the grasping force was analyzed using power equations to enhance the understanding of the system's performance.Experimental validation using a 3D-printed prototype demonstrates the three grasping modes of the hand in various scenarios under environmental constraints, verifying its grasping stability and broad applicability.



Whole-Body Coordination for Dynamic Object Grasping with Legged Manipulators

Liang, Qiwei, Cai, Boyang, He, Rongyi, Li, Hui, Teng, Tao, Duan, Haihan, Huang, Changxin, Zeng, Runhao

arXiv.org Artificial Intelligence

Quadrupedal robots with manipulators offer strong mobility and adaptability for grasping in unstructured, dynamic environments through coordinated whole-body control. However, existing research has predominantly focused on static-object grasping, neglecting the challenges posed by dynamic targets and thus limiting applicability in dynamic scenarios such as logistics sorting and human-robot collaboration. To address this, we introduce DQ-Bench, a new benchmark that systematically evaluates dynamic grasping across varying object motions, velocities, heights, object types, and terrain complexities, along with comprehensive evaluation metrics. Building upon this benchmark, we propose DQ-Net, a compact teacher-student framework designed to infer grasp configurations from limited perceptual cues. During training, the teacher network leverages privileged information to holistically model both the static geometric properties and dynamic motion characteristics of the target, and integrates a grasp fusion module to deliver robust guidance for motion planning. Concurrently, we design a lightweight student network that performs dual-viewpoint temporal modeling using only the target mask, depth map, and proprioceptive state, enabling closed-loop action outputs without reliance on privileged data. Extensive experiments on DQ-Bench demonstrate that DQ-Net achieves robust dynamic objects grasping across multiple task settings, substantially outperforming baseline methods in both success rate and responsiveness.


Grasp-HGN: Grasping the Unexpected

Zandigohar, Mehrshad, Dasari, Mallesham, Schirner, Gunar

arXiv.org Artificial Intelligence

For transradial amputees, robotic prosthetic hands promise to regain the capability to perform daily living activities. To advance next-generation prosthetic hand control design, it is crucial to address current shortcomings in robustness to out of lab artifacts, and generalizability to new environments. Due to the fixed number of object to interact with in existing datasets, contrasted with the virtually infinite variety of objects encountered in the real world, current grasp models perform poorly on unseen objects, negatively affecting users' independence and quality of life. To address this: (i) we define semantic projection, the ability of a model to generalize to unseen object types and show that conventional models like YOLO, despite 80% training accuracy, drop to 15% on unseen objects. (ii) we propose Grasp-LLaVA, a Grasp Vision Language Model enabling human-like reasoning to infer the suitable grasp type estimate based on the object's physical characteristics resulting in a significant 50.2% accuracy over unseen object types compared to 36.7% accuracy of an SOTA grasp estimation model. Lastly, to bridge the performance-latency gap, we propose Hybrid Grasp Network (HGN), an edge-cloud deployment infrastructure enabling fast grasp estimation on edge and accurate cloud inference as a fail-safe, effectively expanding the latency vs. accuracy Pareto. HGN with confidence calibration (DC) enables dynamic switching between edge and cloud models, improving semantic projection accuracy by 5.6% (to 42.3%) with 3.5x speedup over the unseen object types. Over a real-world sample mix, it reaches 86% average accuracy (12.2% gain over edge-only), and 2.2x faster inference than Grasp-LLaVA alone.


Foundation Model-Driven Grasping of Unknown Objects via Center of Gravity Estimation

Xiangli, Kang, He, Yage, Gong, Xianwu, Liu, Zehan, Bai, Yuru

arXiv.org Artificial Intelligence

This study presents a grasping method for objects with uneven mass distribution by leveraging diffusion models to localize the center of gravity (CoG) on unknown objects. In robotic grasping, CoG deviation often leads to postural instability, where existing keypoint-based or affordance-driven methods exhibit limitations. We constructed a dataset of 790 images featuring unevenly distributed objects with keypoint annotations for CoG localization. A vision-driven framework based on foundation models was developed to achieve CoG-aware grasping. Experimental evaluations across real-world scenarios demonstrate that our method achieves a 49\% higher success rate compared to conventional keypoint-based approaches and an 11\% improvement over state-of-the-art affordance-driven methods. The system exhibits strong generalization with a 76\% CoG localization accuracy on unseen objects, providing a novel solution for precise and stable grasping tasks.


Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception

Joyce, Eric C., Zhao, Qianwen, Burgdorfer, Nathaniel, Wang, Long, Mordohai, Philippos

arXiv.org Artificial Intelligence

--Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the downstream task of robotic grasping. We propose a method for training lightweight, deep networks to predict whether a grasp guided by an image-based pose estimate will succeed before that grasp is attempted. We generate training data for our networks via object pose estimation on real images and simulated grasping. We also find that, despite high object variability in grasping trials, networks benefit from training on all objects jointly, suggesting that a diverse variety of objects can nevertheless contribute to the same goal. Remarkable progress in object pose estimation from single RGB images has been made in the past few years [1]-[4], primarily driven by deep learning and the ability to reduce the so-called sim2real gap . This has enabled end-to-end system training on large amounts of synthetic data with precise ground truth. Consider for example the pose estimates illustrated in Figure 1. These were made by current methods, yet all four caused grasping attempts to fail when used as guides. Motivated by this disconnect between pose evaluation and success in downstream grasping, we propose an approach to estimate the likelihood for success before a grasp is actually attempted.


PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation

Yin, Yifan, Han, Zhengtao, Aarya, Shivam, Wang, Jianxin, Xu, Shuhang, Peng, Jiawei, Wang, Angtian, Yuille, Alan, Shu, Tianmin

arXiv.org Artificial Intelligence

Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. In this work, we introduce PartInstruct, the first large-scale benchmark for training and evaluating fine-grained robot manipulation models using part-level instructions. PartInstruct comprises 513 object instances across 14 categories, each annotated with part-level information, and 1302 fine-grained manipulation tasks organized into 16 task classes. Our training set consists of over 10,000 expert demonstrations synthesized in a 3D simulator, where each demonstration is paired with a high-level task instruction, a chain of base part-based skill instructions, and ground-truth 3D information about the object and its parts. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art robot manipulation approaches, including end-to-end vision-language policy learning and bi-level planning models for robot manipulation on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts and predict actions in 3D space, and face challenges when manipulating object parts in long-horizon tasks.