Shi, Guanya
Humanoid Policy ~ Human Policy
Qiu, Ri-Zhao, Yang, Shiqi, Cheng, Xuxin, Chawla, Chaitanya, Li, Jialong, He, Tairan, Yan, Ge, Paulsen, Lars, Yang, Ge, Yi, Sha, Shi, Guanya, Wang, Xiaolong
Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: https://human-as-robot.github.io/
Whole-Body Model-Predictive Control of Legged Robots with MuJoCo
Zhang, John Z., Howell, Taylor A., Yi, Zeji, Pan, Chaoyi, Shi, Guanya, Qu, Guannan, Erez, Tom, Tassa, Yuval, Manchester, Zachary
We demonstrate the surprising real-world effectiveness of a very simple approach to whole-body model-predictive control (MPC) of quadruped and humanoid robots: the iterative LQR (iLQR) algorithm with MuJoCo dynamics and finite-difference approximated derivatives. Building upon the previous success of model-based behavior synthesis and control of locomotion and manipulation tasks with MuJoCo in simulation, we show that these policies can easily generalize to the real world with few sim-to-real considerations. Our baseline method achieves real-time whole-body MPC on a variety of hardware experiments, including dynamic quadruped locomotion, quadruped walking on two legs, and full-sized humanoid bipedal locomotion. We hope this easy-to-reproduce hardware baseline lowers the barrier to entry for real-world whole-body MPC research and contributes to accelerating research velocity in the community. Our code and experiment videos will be available online at:https://johnzhang3.github.io/mujoco_ilqr
ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills
He, Tairan, Gao, Jiawei, Xiao, Wenli, Zhang, Yuanhang, Wang, Zi, Wang, Jiashun, Luo, Zhengyi, He, Guanqi, Sobanbab, Nikhil, Pan, Chaoyi, Yi, Zeji, Qu, Guannan, Kitani, Kris, Hodgins, Jessica, Fan, Linxi "Jim", Zhu, Yuke, Liu, Changliu, Shi, Guanya
The humanoid robot (Unitree G1) demonstrates diverse agile whole-body skills, showcasing the control policies' agility: (a) Cristiano Ronaldo's signature celebration involving a jump with a 180-degree mid-air rotation; (b) LeBron James's "Silencer" celebration involving single-leg balancing; and (c) Kobe Bryant's famous fadeaway jump shot involving single-leg jumping and landing; (d) 1.5m-forward jumping; (e) Leg stretching; (f) 1.3m-side jumping. Abstract -- Humanoid robots hold the potential for unparalleled versatility for performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive parameter tuning or result in overly conservative policies that sacrifice agility. In this paper, we present ASAP (Aligning Simulation and Real Physics), a two-stage framework designed to tackle the dynamics mismatch and enable agile humanoid whole-body skills. Then ASAP fine-tunes pre-trained policies with the delta action model integrated into the simulator to align effectively with real-world dynamics. We evaluate ASAP across three transfer scenarios--IsaacGym to IsaacSim, IsaacGym to Genesis, and IsaacGym to the real-world Unitree G1 humanoid robot. Our approach significantly improves agility and whole-body coordination across various dynamic motions, reducing tracking error compared to SysID, DR, and delta dynamics learning baselines. ASAP enables highly agile motions that were previously difficult to achieve, demonstrating the potential of delta action learning in bridging simulation and real-world dynamics. These results suggest a promising sim-to-real direction for developing more expressive and agile humanoids. I NTRODUCTION For decades, we have envisioned humanoid robots achieving or even surpassing human-level agility. However, most prior work [46, 74, 47, 73, 107, 19, 95, 50] has primarily focused on locomotion, treating the legs as a means of mobility. Recent studies [10, 25, 24, 26, 32] have introduced whole-body expressiveness in humanoid robots, but these efforts have primarily focused on upper-body motions and have yet to achieve the agility seen in human movement.
TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint
Lin, Haotian, Wang, Pengcheng, Schneider, Jeff, Shi, Guanya
Through theoretical analysis in TD-MPC implementation leads to persistent value and experiments, we argue that this issue is deeply rooted overestimation. It is also empirically observed that the performance in the structural policy mismatch between the data generation of TD-MPC2 is far from satisfactory at some policy that is always bootstrapped by the planner and high-dimensional locomotion tasks [33]. This phenomenon the learned policy prior. To mitigate such a mismatch in is closely connected to, yet distinct from, the well-known a minimalist way, we propose a policy regularization term overestimation bias arising from function approximation reducing out-of-distribution (OOD) queries, thereby improving errors and error accumulation in temporal difference learning value learning. Our method involves minimum changes [39, 37, 7]. More precisely, we identify the underlying on top of existing frameworks and requires no additional issue as policy mismatch. The behavior policy generated by computation. Extensive experiments demonstrate that the the MPC planner governs data collection, creating a buffered proposed approach improves performance over baselines data distribution that does not directly align with the learned such as TD-MPC2 by large margins, particularly in 61-DoF value or policy prior.
Bridging Adaptivity and Safety: Learning Agile Collision-Free Locomotion Across Varied Physics
Zhong, Yichao, Zhang, Chong, He, Tairan, Shi, Guanya
Real-world legged locomotion systems often need to reconcile agility and safety for different scenarios. Moreover, the underlying dynamics are often unknown and time-variant (e.g., payload, friction). In this paper, we introduce BAS (Bridging Adaptivity and Safety), which builds upon the pipeline of prior work Agile But Safe (ABS) (He et al., 2024b) and is designed to provide adaptive safety even in dynamic environments with uncertainties. BAS involves an agile policy to avoid obstacles rapidly and a recovery policy to prevent collisions, a physical parameter estimator that is concurrently trained with agile policy, and a learned control-theoretic RA (reach-avoid) value network that governs the policy switch. Also, the agile policy and RA network are both conditioned on physical parameters to make them adaptive. To mitigate the distribution shift issue, we further introduce an on-policy fine-tuning phase for the estimator to enhance its robustness and accuracy. The simulation results show that BAS achieves 50% better safety than baselines in dynamic environments while maintaining a higher speed on average. In real-world experiments, BAS shows its capability in complex environments with unknown physics (e.g., slippery floors with unknown frictions, unknown payloads up to 8kg), while baselines lack adaptivity, leading to collisions or degraded agility. As a result, BAS achieves a 19.8% increase in speed and gets a 2.36 times lower collision rate than ABS in the real world.
Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning
Gu, Zhaoyuan, Li, Junheng, Shen, Wenlan, Yu, Wenhao, Xie, Zhaoming, McCrory, Stephen, Cheng, Xianyi, Shamsah, Abdulaziz, Griffin, Robert, Liu, C. Karen, Kheddar, Abderrahmane, Peng, Xue Bin, Zhu, Yuke, Shi, Guanya, Nguyen, Quan, Cheng, Gordon, Gao, Huijun, Zhao, Ye
Humanoid robots have great potential to perform various human-level skills. These skills involve locomotion, manipulation, and cognitive capabilities. Driven by advances in machine learning and the strength of existing model-based approaches, these capabilities have progressed rapidly, but often separately. Therefore, a timely overview of current progress and future trends in this fast-evolving field is essential. This survey first summarizes the model-based planning and control that have been the backbone of humanoid robotics for the past three decades. We then explore emerging learning-based methods, with a focus on reinforcement learning and imitation learning that enhance the versatility of loco-manipulation skills. We examine the potential of integrating foundation models with humanoid embodiments, assessing the prospects for developing generalist humanoid agents. In addition, this survey covers emerging research for whole-body tactile sensing that unlocks new humanoid skills that involve physical interactions. The survey concludes with a discussion of the challenges and future trends.
Q-learning-based Model-free Safety Filter
Sue, Guo Ning, Choudhary, Yogita, Desatnik, Richard, Majidi, Carmel, Dolan, John, Shi, Guanya
Ensuring safety via safety filters in real-world robotics presents significant challenges, particularly when the system dynamics is complex or unavailable. To handle this issue, learning-based safety filters recently gained popularity, which can be classified as model-based and model-free methods. Existing model-based approaches requires various assumptions on system model (e.g., control-affine), which limits their application in complex systems, and existing model-free approaches need substantial modifications to standard RL algorithms and lack versatility. This paper proposes a simple, plugin-and-play, and effective model-free safety filter learning framework. We introduce a novel reward formulation and use Q-learning to learn Q-value functions to safeguard arbitrary task specific nominal policies via filtering out their potentially unsafe actions. The threshold used in the filtering process is supported by our theoretical analysis. Due to its model-free nature and simplicity, our framework can be seamlessly integrated with various RL algorithms. We validate the proposed approach through simulations on double integrator and Dubin's car systems and demonstrate its effectiveness in real-world experiments with a soft robotic limb.
Self-Supervised Meta-Learning for All-Layer DNN-Based Adaptive Control with Stability Guarantees
He, Guanqi, Choudhary, Yogita, Shi, Guanya
A critical goal of adaptive control is enabling robots to rapidly adapt in dynamic environments. Recent studies have developed a meta-learning-based adaptive control scheme, which uses meta-learning to extract nonlinear features (represented by Deep Neural Networks (DNNs)) from offline data, and uses adaptive control to update linear coefficients online. However, such a scheme is fundamentally limited by the linear parameterization of uncertainties and does not fully unleash the capability of DNNs. This paper introduces a novel learning-based adaptive control framework that pretrains a DNN via self-supervised meta-learning (SSML) from offline trajectories and online adapts the full DNN via composite adaptation. In particular, the offline SSML stage leverages the time consistency in trajectory data to train the DNN to predict future disturbances from history, in a self-supervised manner without environment condition labels. The online stage carefully designs a control law and an adaptation law to update the full DNN with stability guarantees. Empirically, the proposed framework significantly outperforms (19-39%) various classic and learning-based adaptive control baselines, in challenging real-world quadrotor tracking problems under large dynamic wind disturbance.
HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots
He, Tairan, Xiao, Wenli, Lin, Toru, Luo, Zhengyi, Xu, Zhenjia, Jiang, Zhenyu, Kautz, Jan, Liu, Changliu, Shi, Guanya, Wang, Xiaolong, Fan, Linxi, Zhu, Yuke
Humanoid whole-body control requires adapting to diverse tasks such as navigation, loco-manipulation, and tabletop manipulation, each demanding a different mode of control. For example, navigation relies on root velocity tracking, while tabletop manipulation prioritizes upper-body joint angle tracking. Existing approaches typically train individual policies tailored to a specific command space, limiting their transferability across modes. We present the key insight that full-body kinematic motion imitation can serve as a common abstraction for all these tasks and provide general-purpose motor skills for learning multiple modes of whole-body control. Building on this, we propose HOVER (Humanoid Versatile Controller), a multi-mode policy distillation framework that consolidates diverse control modes into a unified policy. HOVER enables seamless transitions between control modes while preserving the distinct advantages of each, offering a robust and scalable solution for humanoid control across a wide range of modes. By eliminating the need for policy retraining for each control mode, our approach improves efficiency and flexibility for future humanoid applications.
Agile Mobility with Rapid Online Adaptation via Meta-learning and Uncertainty-aware MPPI
Kalaria, Dvij, Xue, Haoru, Xiao, Wenli, Tao, Tony, Shi, Guanya, Dolan, John M.
Modern non-linear model-based controllers require an accurate physics model and model parameters to be able to control mobile robots at their limits. Also, due to surface slipping at high speeds, the friction parameters may continually change (like tire degradation in autonomous racing), and the controller may need to adapt rapidly. Many works derive a task-specific robot model with a parameter adaptation scheme that works well for the task but requires a lot of effort and tuning for each platform and task. In this work, we design a full model-learning-based controller based on meta pre-training that can very quickly adapt using few-shot dynamics data to any wheel-based robot with any model parameters, while also reasoning about model uncertainty. We demonstrate our results in small-scale numeric simulation, the large-scale Unity simulator, and on a medium-scale hardware platform with a wide range of settings. We show that our results are comparable to domain-specific well-engineered controllers, and have excellent generalization performance across all scenarios.