Li, Puhao
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations
Li, Puhao, Liu, Tengyu, Li, Yuyang, Han, Muzhi, Geng, Haoran, Wang, Shu, Zhu, Yixin, Zhu, Song-Chun, Huang, Siyuan
Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task representations. We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting these challenges through two key innovations: a novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability; and an agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object. Ag2Manip's empirical validation across simulated benchmarks like FrankaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieved without domain-specific demonstrations. Ablation studies underline the essential contributions of the visual and action representations to this success. Extending our evaluations to the real world, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across both simulated and physical environments.
An Embodied Generalist Agent in 3D World
Huang, Jiangyong, Yong, Silong, Ma, Xiaojian, Linghu, Xiongkun, Li, Puhao, Wang, Yan, Li, Qing, Zhu, Song-Chun, Jia, Baoxiong, Huang, Siyuan
Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics. However, a significant challenge remains as these models exhibit limited ability in understanding and interacting with the 3D world. We argue this limitation significantly hinders the current models from performing real-world tasks and further achieving general intelligence. To this end, we introduce an embodied multi-modal and multi-task generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. Our proposed agent, referred to as LEO, is trained with shared LLM-based model architectures, objectives, and weights in two stages: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. To facilitate the training, we meticulously curate and generate an extensive dataset comprising object-level and scene-level multi-modal tasks with exceeding scale and complexity, necessitating a deep understanding of and interaction with the 3D world. Through rigorous experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, embodied navigation, and robotic manipulation. Our ablation results further provide valuable insights for the development of future embodied generalist agents.
Grasp Multiple Objects with One Hand
Li, Yuyang, Liu, Bo, Geng, Yiran, Li, Puhao, Yang, Yaodong, Zhu, Yixin, Liu, Tengyu, Huang, Siyuan
Our work aligns more with the second approach, dataset tailored for multi-object grasping research; (ii) the aiming to maintain individual object maneuverability while development of the first Goal-Conditioned Reinforcement boosting grasp efficiency. Learning (GCRL) policy for concurrent grasping and lifting Reinforcement Learning (RL): Robots often operate of multiple objects from a table; (iii) the enhancement of in complex physical environments, making analytical the execution policy for better adaptability to unseen object solutions challenging due to noisy sensory input. RL is configurations and imprecise pre-grasp poses, achieved via commonly used for decision-making and control in these specialist distillation and curriculum learning; (iv) a comprehensive cases [4, 5, 16, 40, 41]. As a specialized form, GCRL [42] framework, MultiGrasp, that extends existing robotic focuses on skill acquisition for predefined objectives, but systems toward robust, accurate multi-object grasping.
DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation
Wang, Ruicheng, Zhang, Jialiang, Chen, Jiayi, Xu, Yinzhen, Li, Puhao, Liu, Tengyu, Wang, He
Robotic dexterous grasping is the first step to enable human-like dexterous object manipulation and thus a crucial robotic technology. However, dexterous grasping is much more under-explored than object grasping with parallel grippers, partially due to the lack of a large-scale dataset. In this work, we present a large-scale robotic dexterous grasp dataset, DexGraspNet, generated by our proposed highly efficient synthesis method that can be generally applied to any dexterous hand. Our method leverages a deeply accelerated differentiable force closure estimator and thus can efficiently and robustly synthesize stable and diverse grasps on a large scale. We choose ShadowHand and generate 1.32 million grasps for 5355 objects, covering more than 133 object categories and containing more than 200 diverse grasps for each object instance, with all grasps having been validated by the Isaac Gym simulator. Compared to the previous dataset from Liu et al. generated by GraspIt!, our dataset has not only more objects and grasps, but also higher diversity and quality. Via performing cross-dataset experiments, we show that training several algorithms of dexterous grasp synthesis on our dataset significantly outperforms training on the previous one. To access our data and code, including code for human and Allegro grasp synthesis, please visit our project page: https://pku-epic.github.io/DexGraspNet/.
GenDexGrasp: Generalizable Dexterous Grasping
Li, Puhao, Liu, Tengyu, Li, Yuyang, Geng, Yiran, Zhu, Yixin, Yang, Yaodong, Huang, Siyuan
Generating dexterous grasping has been a long-standing and challenging robotic task. Despite recent progress, existing methods primarily suffer from two issues. First, most prior arts focus on a specific type of robot hand, lacking the generalizable capability of handling unseen ones. Second, prior arts oftentimes fail to rapidly generate diverse grasps with a high success rate. To jointly tackle these challenges with a unified solution, we propose GenDexGrasp, a novel hand-agnostic grasping algorithm for generalizable grasping. GenDexGrasp is trained on our proposed large-scale multi-hand grasping dataset MultiDex synthesized with force closure optimization. By leveraging the contact map as a hand-agnostic intermediate representation, GenDexGrasp efficiently generates diverse and plausible grasping poses with a high success rate and can transfer among diverse multi-fingered robotic hands. Compared with previous methods, GenDexGrasp achieves a three-way trade-off among success rate, inference speed, and diversity. Code is available at https://github.com/tengyu-liu/GenDexGrasp.