HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos

Weng, Haoyang, Li, Yitang, Sobanbabu, Nikhil, Wang, Zihan, Luo, Zhengyi, He, Tairan, Ramanan, Deva, Shi, Guanya

Sep-30-2025–arXiv.org Artificial Intelligence

Figure 1: HDMI enables humanoid robots to acquire diverse whole-body interaction skills directly from human videos. Abstract-- Enabling robust whole-body humanoid-object interaction (HOI) remains challenging due to motion data scarcity and the contact-rich nature. We present HDMI (H umanoiD iM itation for I nteraction), a simple and general framework that learns whole-body humanoid-object interaction skills directly from monocular RGB videos. Our pipeline (i) extracts and retargets human and object trajectories from unconstrained videos to build structured motion datasets, (ii) trains a rein-The authors are with the Robotics Institute, Carnegie Mellon University, USA. Extensive sim-to-real experiments on a Unitree G1 humanoid demonstrate the robustness and generality of our approach: HDMI achieves 67 consecutive door traversals and successfully performs 6 distinct loco-manipulation tasks in the real world and 14 tasks in simulation. Our results establish HDMI as a simple and general framework for acquiring interactive humanoid skills from human videos. Humanoid robots hold immense potential for assisting humans in diverse environments due to their human-like morphology and versatility.

artificial intelligence, arxiv, robot, (14 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.04)
- North America > United States
  - Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.55)