Not enough data to create a plot.
Try a different view from the menu above.
Zhu, Yuke
Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation
Maddukuri, Abhiram, Jiang, Zhenyu, Chen, Lawrence Yunliang, Nasiriany, Soroush, Xie, Yuqi, Fang, Yu, Huang, Wenqi, Wang, Zu, Xu, Zhenjia, Chernyadev, Nikita, Reed, Scott, Goldberg, Ken, Mandlekar, Ajay, Fan, Linxi, Zhu, Yuke
Large real-world robot datasets hold great potential to train generalist robot models, but scaling real-world human data collection is time-consuming and resource-intensive. Simulation has great potential in supplementing large-scale data, especially with recent advances in generative AI and automated data generation tools that enable scalable creation of robot behavior datasets. However, training a policy solely in simulation and transferring it to the real world often demands substantial human effort to bridge the reality gap. A compelling alternative is to co-train the policy on a mixture of simulation and real-world datasets. Preliminary studies have recently shown this strategy to substantially improve the performance of a policy over one trained on a limited amount of real-world data. Nonetheless, the community lacks a systematic understanding of sim-and-real co-training and what it takes to reap the benefits of simulation data for real-robot learning. This work presents a simple yet effective recipe for utilizing simulation data to solve vision-based robotic manipulation tasks. We derive this recipe from comprehensive experiments that validate the co-training strategy on various simulation and real-world datasets. Using two domains--a robot arm and a humanoid--across diverse tasks, we demonstrate that simulation data can enhance real-world task performance by an average of 38%, even with notable differences between the simulation and real-world data. Videos and additional results can be found at https://co-training.github.io/
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA, null, Bjorck, Johan, Castaรฑeda, Fernando, Cherniadev, Nikita, Da, Xingye, Ding, Runyu, Fan, Linxi "Jim", Fang, Yu, Fox, Dieter, Hu, Fengyuan, Huang, Spencer, Jang, Joel, Jiang, Zhenyu, Kautz, Jan, Kundalia, Kaushil, Lao, Lawrence, Li, Zhiqi, Lin, Zongyu, Lin, Kevin, Liu, Guilin, Llontop, Edith, Magne, Loic, Mandlekar, Ajay, Narayan, Avnish, Nasiriany, Soroush, Reed, Scott, Tan, You Liang, Wang, Guanzhi, Wang, Zu, Wang, Jing, Wang, Qi, Xiang, Jiannan, Xie, Yuqi, Xu, Yinzhen, Xu, Zhenjia, Ye, Seonghyeon, Yu, Zhiding, Zhang, Ao, Zhang, Hao, Zhao, Yizhou, Zheng, Ruijie, Zhu, Yuke
General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids
Lin, Toru, Sachdev, Kartik, Fan, Linxi, Malik, Jitendra, Zhu, Yuke
Reinforcement learning has delivered promising results in achieving human- or even superhuman-level capabilities across diverse problem domains, but success in dexterous robot manipulation remains limited. This work investigates the key challenges in applying reinforcement learning to solve a collection of contact-rich manipulation tasks on a humanoid embodiment. We introduce novel techniques to overcome the identified challenges with empirical validation. Our main contributions include an automated real-to-sim tuning module that brings the simulated environment closer to the real world, a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks, a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance, and a mixture of sparse and dense object representations to bridge the sim-to-real perception gap. We show promising results on three humanoid dexterous manipulation tasks, with ablation studies on each technique. Our work presents a successful approach to learning humanoid dexterous manipulation using sim-to-real reinforcement learning, achieving robust generalization and high performance without the need for human demonstration.
ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills
He, Tairan, Gao, Jiawei, Xiao, Wenli, Zhang, Yuanhang, Wang, Zi, Wang, Jiashun, Luo, Zhengyi, He, Guanqi, Sobanbab, Nikhil, Pan, Chaoyi, Yi, Zeji, Qu, Guannan, Kitani, Kris, Hodgins, Jessica, Fan, Linxi "Jim", Zhu, Yuke, Liu, Changliu, Shi, Guanya
The humanoid robot (Unitree G1) demonstrates diverse agile whole-body skills, showcasing the control policies' agility: (a) Cristiano Ronaldo's signature celebration involving a jump with a 180-degree mid-air rotation; (b) LeBron James's "Silencer" celebration involving single-leg balancing; and (c) Kobe Bryant's famous fadeaway jump shot involving single-leg jumping and landing; (d) 1.5m-forward jumping; (e) Leg stretching; (f) 1.3m-side jumping. Abstract -- Humanoid robots hold the potential for unparalleled versatility for performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive parameter tuning or result in overly conservative policies that sacrifice agility. In this paper, we present ASAP (Aligning Simulation and Real Physics), a two-stage framework designed to tackle the dynamics mismatch and enable agile humanoid whole-body skills. Then ASAP fine-tunes pre-trained policies with the delta action model integrated into the simulator to align effectively with real-world dynamics. We evaluate ASAP across three transfer scenarios--IsaacGym to IsaacSim, IsaacGym to Genesis, and IsaacGym to the real-world Unitree G1 humanoid robot. Our approach significantly improves agility and whole-body coordination across various dynamic motions, reducing tracking error compared to SysID, DR, and delta dynamics learning baselines. ASAP enables highly agile motions that were previously difficult to achieve, demonstrating the potential of delta action learning in bridging simulation and real-world dynamics. These results suggest a promising sim-to-real direction for developing more expressive and agile humanoids. I NTRODUCTION For decades, we have envisioned humanoid robots achieving or even surpassing human-level agility. However, most prior work [46, 74, 47, 73, 107, 19, 95, 50] has primarily focused on locomotion, treating the legs as a means of mobility. Recent studies [10, 25, 24, 26, 32] have introduced whole-body expressiveness in humanoid robots, but these efforts have primarily focused on upper-body motions and have yet to achieve the agility seen in human movement.
Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning
Gu, Zhaoyuan, Li, Junheng, Shen, Wenlan, Yu, Wenhao, Xie, Zhaoming, McCrory, Stephen, Cheng, Xianyi, Shamsah, Abdulaziz, Griffin, Robert, Liu, C. Karen, Kheddar, Abderrahmane, Peng, Xue Bin, Zhu, Yuke, Shi, Guanya, Nguyen, Quan, Cheng, Gordon, Gao, Huijun, Zhao, Ye
Humanoid robots have great potential to perform various human-level skills. These skills involve locomotion, manipulation, and cognitive capabilities. Driven by advances in machine learning and the strength of existing model-based approaches, these capabilities have progressed rapidly, but often separately. Therefore, a timely overview of current progress and future trends in this fast-evolving field is essential. This survey first summarizes the model-based planning and control that have been the backbone of humanoid robotics for the past three decades. We then explore emerging learning-based methods, with a focus on reinforcement learning and imitation learning that enhance the versatility of loco-manipulation skills. We examine the potential of integrating foundation models with humanoid embodiments, assessing the prospects for developing generalist humanoid agents. In addition, this survey covers emerging research for whole-body tactile sensing that unlocks new humanoid skills that involve physical interactions. The survey concludes with a discussion of the challenges and future trends.
AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers
Grigsby, Jake, Sasek, Justin, Parajuli, Samyak, Adebi, Daniel, Zhang, Amy, Zhu, Yuke
Language models trained on diverse datasets unlock generalization by in-context learning. Reinforcement Learning (RL) policies can achieve a similar effect by meta-learning within the memory of a sequence model. However, meta-RL research primarily focuses on adapting to minor variations of a single task. It is difficult to scale towards more general behavior without confronting challenges in multi-task optimization, and few solutions are compatible with meta-RL's goal of learning from large training sets of unlabeled tasks. To address this challenge, we revisit the idea that multi-task RL is bottlenecked by imbalanced training losses created by uneven return scales across different tasks. We build upon recent advancements in Transformer-based (in-context) meta-RL and evaluate a simple yet scalable solution where both an agent's actor and critic objectives are converted to classification terms that decouple optimization from the current scale of returns. Large-scale comparisons in Meta-World ML45, Multi-Game Procgen, Multi-Task POPGym, Multi-Game Atari, and BabyAI find that this design unlocks significant progress in online multi-task adaptation and memory problems without explicit task labels.
LEGATO: Cross-Embodiment Imitation Using a Grasping Tool
Seo, Mingyo, Park, H. Andy, Yuan, Shenli, Zhu, Yuke, Sentis, Luis
Cross-embodiment imitation learning enables policies trained on specific embodiments to transfer across different robots, unlocking the potential for large-scale imitation learning that is both cost-effective and highly reusable. This paper presents LEGATO, a cross-embodiment imitation learning framework for visuomotor skill transfer across varied kinematic morphologies. We introduce a handheld gripper that unifies action and observation spaces, allowing tasks to be defined consistently across robots. Using this gripper, we train visuomotor policies via imitation learning, applying a motion-invariant transformation to compute the training loss. Gripper motions are then retargeted into high-degree-of-freedom whole-body motions using inverse kinematics for deployment across diverse embodiments. Our evaluations in simulation and real-robot experiments highlight the framework's effectiveness in learning and transferring visuomotor skills across various robots. More information can be found at the project page: https://ut-hcrl.github.io/LEGATO.
RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation
Nasiriany, Soroush, Kirmani, Sean, Ding, Tianli, Smith, Laura, Zhu, Yuke, Driess, Danny, Sadigh, Dorsa, Xiao, Ted
We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at https://snasiriany.me/rt-affordance
SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation
Hsu, Cheng-Chun, Wen, Bowen, Xu, Jie, Narang, Yashraj, Wang, Xiaolong, Zhu, Yuke, Biswas, Joydeep, Birchfield, Stan
We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We show improvement compared to prior work on RLBench simulated tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints. Project page: https://nvlabs.github.io/object_centric_diffusion
DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning
Jiang, Zhenyu, Xie, Yuqi, Lin, Kevin, Xu, Zhenjia, Wan, Weikang, Mandlekar, Ajay, Fan, Linxi, Zhu, Yuke
Imitation learning from human demonstrations is an effective means to teach robots manipulation skills. But data acquisition is a major bottleneck in applying this paradigm more broadly, due to the amount of cost and human effort involved. There has been significant interest in imitation learning for bimanual dexterous robots, like humanoids. Unfortunately, data collection is even more challenging here due to the challenges of simultaneously controlling multiple arms and multi-fingered hands. Automated data generation in simulation is a compelling, scalable alternative to fuel this need for data. To this end, we introduce DexMimicGen, a large-scale automated data generation system that synthesizes trajectories from a handful of human demonstrations for humanoid robots with dexterous hands. We present a collection of simulation environments in the setting of bimanual dexterous manipulation, spanning a range of manipulation behaviors and different requirements for coordination among the two arms. We generate 21K demos across these tasks from just 60 source human demos and study the effect of several data generation and policy learning decisions on agent performance. Finally, we present a real-to-sim-to-real pipeline and deploy it on a real-world humanoid can sorting task. Videos and more are at https://dexmimicgen.github.io/