InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

Xu, Sirui, Ling, Hung Yu, Wang, Yu-Xiong, Gui, Liang-Yan

Feb-27-2025–arXiv.org Artificial Intelligence

Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-27-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)
- North America > United States
  - Illinois (0.14)

Genre:
- Research Report (0.63)

Industry:
- Education (0.93)
- Leisure & Entertainment > Sports (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.67)
  - Natural Language > Large Language Model (0.66)
  - Representation & Reasoning
    - Model-Based Reasoning (0.61)
    - Optimization (0.46)
  - Robots (1.00)
  - Vision > Video Understanding (0.68)