UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
Kim, Hanjung, Kang, Jaehyun, Kang, Hyolim, Cho, Meedeum, Kim, Seon Joo, Lee, Youngwoon
–arXiv.org Artificial Intelligence
Learning from human videos has emerged as a central paradigm in robot learning, offering a scalable approach to the scarcity of robot-specific data by leveraging large, diverse video sources. Human videos contain everyday behaviors such as human-object interactions, which could provide a rich source of skills for robot learning. Here, a central question arises: Can robots acquire cross-embodiment skill representations by watching large-scale human demonstrations? Translating human videos into robot-executable skill representations has traditionally relied on paired human-robot datasets [1, 2, 3] or predefined semantic skill labels [4, 5], both of which are difficult to scale. Recent approaches aim to bypass these requirements by learning cross-embodiment skill representations without explicit pairing or labeling [6, 7, 8, 9, 10]. However, these methods still impose constraints on data collection, such as multi-view camera setups, and task and scene alignment between human and robot demonstrations, which limit their scalability and applicability to real-world, in-the-wild human videos. To this end, we propose Universal Skill representations (UniSkill), a scalable approach for learning cross-embodiment skill representations from large-scale in-the-wild video data so that a robot can translate an unseen human demonstration into a sequence of robot-executable skill representations, as illustrated in Figure 1.
arXiv.org Artificial Intelligence
Sep-23-2025
- Country:
- Asia > South Korea
- Europe > Netherlands
- South Holland > Delft (0.04)
- North America > United States (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: