UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations

Kim, Hanjung, Kang, Jaehyun, Kang, Hyolim, Cho, Meedeum, Kim, Seon Joo, Lee, Youngwoon

arXiv.org Artificial Intelligence 

Learning from human videos has emerged as a central paradigm in robot learning, offering a scalable approach to the scarcity of robot-specific data by leveraging large, diverse video sources. Human videos contain everyday behaviors such as human-object interactions, which could provide a rich source of skills for robot learning. Here, a central question arises: Can robots acquire cross-embodiment skill representations by watching large-scale human demonstrations? Translating human videos into robot-executable skill representations has traditionally relied on paired human-robot datasets [1, 2, 3] or predefined semantic skill labels [4, 5], both of which are difficult to scale. Recent approaches aim to bypass these requirements by learning cross-embodiment skill representations without explicit pairing or labeling [6, 7, 8, 9, 10]. However, these methods still impose constraints on data collection, such as multi-view camera setups, and task and scene alignment between human and robot demonstrations, which limit their scalability and applicability to real-world, in-the-wild human videos. To this end, we propose Universal Skill representations (UniSkill), a scalable approach for learning cross-embodiment skill representations from large-scale in-the-wild video data so that a robot can translate an unseen human demonstration into a sequence of robot-executable skill representations, as illustrated in Figure 1.