Goto

Collaborating Authors

 motion data



Unveiling the Impact of Data and Model Scaling on High-Level Control for Humanoid Robots

Wei, Yuxi, Wang, Zirui, Yin, Kangning, Hu, Yue, Wang, Jingbo, Chen, Siheng

arXiv.org Artificial Intelligence

Abstract-- Data scaling has long remained a critical bottleneck in robot learning. For humanoid robots, human videos and motion data are abundant and widely available, offering a free and large-scale data source. Besides, the semantics related to the motions enable modality alignment and high-level robot control learning. However, how to effectively mine raw video, extract robot-learnable representations, and leverage them for scalable learning remains an open problem. T o address this, we introduce Humanoid-Union, a large-scale dataset generated through an autonomous pipeline, comprising over 260 hours of diverse, high-quality humanoid robot motion data with semantic annotations derived from human motion videos. The dataset can be further expanded via the same pipeline. Building on this data resource, we propose SCHUR, a scalable learning framework designed to explore the impact of large-scale data on high-level control in humanoid robots. Experimental results demonstrate that SCHUR achieves high robot motion generation quality and strong text-motion alignment under data and model scaling, with 37% reconstruction improvement under MPJPE and 25% alignment improvement under FID comparing with previous methods. Its effectiveness is further validated through deployment in real-world humanoid robot.


Multi-Domain Motion Embedding: Expressive Real-Time Mimicry for Legged Robots

Heyrman, Matthias, Li, Chenhao, Klemm, Victor, Kang, Dongho, Coros, Stelian, Hutter, Marco

arXiv.org Artificial Intelligence

Effective motion representation is crucial for enabling robots to imitate expressive behaviors in real time, yet existing motion controllers often ignore inherent patterns in motion. Previous efforts in representation learning do not attempt to jointly capture structured periodic patterns and irregular variations in human and animal movement. To address this, we present Multi-Domain Motion Embedding (MDME), a motion representation that unifies the embedding of structured and unstructured features using a wavelet-based encoder and a probabilistic embedding in parallel. This produces a rich representation of reference motions from a minimal input set, enabling improved generalization across diverse motion styles and morphologies. We evaluate MDME on retargeting-free real-time motion imitation by conditioning robot control policies on the learned embeddings, demonstrating accurate reproduction of complex trajectories on both humanoid and quadruped platforms. Our comparative studies confirm that MDME outperforms prior approaches in reconstruction fidelity and generalizability to unseen motions. Furthermore, we demonstrate that MDME can reproduce novel motion styles in real-time through zero-shot deployment, eliminating the need for task-specific tuning or online retargeting. These results position MDME as a generalizable and structure-aware foundation for scalable real-time robot imitation.



VAE-Based Synthetic EMG Generation with Mix-Consistency Loss for Recognizing Unseen Motion Combinations

Yazawa, Itsuki, Furui, Akira

arXiv.org Artificial Intelligence

Electromyogram (EMG)-based motion classification using machine learning has been widely employed in applications such as prosthesis control. While previous studies have explored generating synthetic patterns of combined motions to reduce training data requirements, these methods assume that combined motions can be represented as linear combinations of basic motions. However, this assumption often fails due to complex neuromuscular phenomena such as muscle co-contraction, resulting in low-fidelity synthetic signals and degraded classification performance. To address this limitation, we propose a novel method that learns to synthesize combined motion patterns in a structured latent space. Specifically, we employ a variational autoencoder (VAE) to encode EMG signals into a low-dimensional representation and introduce a mixconsistency loss that structures the latent space such that combined motions are embedded between their constituent basic motions. Synthetic patterns are then generated within this structured latent space and used to train classifiers for recognizing unseen combined motions. We validated our approach through upper-limb motion classification experiments with eight healthy participants. The results demonstrate that our method outperforms input-space synthesis approaches, achieving approximately 30% improvement in accuracy.


PHUMA: Physically-Grounded Humanoid Locomotion Dataset

Lee, Kyungmin, Kim, Sibeen, Park, Minho, Kim, Hyunseung, Hwang, Dongyoon, Lee, Hojoon, Choo, Jaegul

arXiv.org Artificial Intelligence

Each column illustrates four failure modes: joint violation, floating, penetration, and skating. Humanoid-X (Mao et al., 2025) (top row) often exhibits these issues due to direct video-to-motion conversion, while PHUMA (bottom row) mitigates those violations through careful data curation and physically grounded retargeting. Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA. Humanoid robots are central to the pursuit of general-purpose embodied AI, but their deployment in real-world first requires locomotion that is both stable and humanlike.




Understanding Cognitive States from Head & Hand Motion Data

Wen, Kaiang, Miller, Mark Roman

arXiv.org Artificial Intelligence

The pipeline illustrates the full workflow from data collection in VR, through self-annotation and human baseline evaluation, to modeling and analysis of cognitive states. As virtual reality (VR) and augmented reality (AR) continue to gain popularity, head and hand motion data captured by consumer VR systems have become ubiquitous. Prior work shows such telemetry can be highly identifying and reflect broad user traits, often aligning with intuitive "folk theories" of body language. However, it remains unclear to what extent motion kinematics encode more nuanced cognitive states, such as confusion, hesitation, and readiness, which lack clear correlates with motion. To investigate this, we introduce a novel dataset of head and hand motion with frame-level annotations of these states collected during structured decision-making tasks. Our findings suggest that deep temporal models can infer subtle cognitive states from motion alone, achieving comparable performance with human observers. This work demonstrates that standard VR telemetry contains strong patterns related to users' internal cognitive processes, which opens the door for a new gener- To enhance reproducibility and support future work, we will make our dataset and modeling framework publicly available. Virtual Reality (VR) is rapidly evolving from a specialized tool for simulation and entertainment into a mainstream computing platform for work, education, and social interaction. As users spend more time in these immersive environments, the quality of human-computer interaction becomes paramount. The next generation of VR systems must move beyond explicit, command-based interfaces and develop the capacity for implicit, nuanced understanding. This requires an ability to perceive and adapt to a user's cognitive state in real-time, creating experiences that are more intuitive, supportive, and effective. The key to unlocking this capability lies in decoding the rich, continuous, and often subconscious stream of motion data generated by every user.


A CARLA-based Simulation of Electrically Driven Forklifts

Claus, David, Thielemann, Christiane, Stark, Hans-Georg

arXiv.org Artificial Intelligence

This paper presents the simulation of the operation of an electric forklift fleet within an intralogistics scenario. For this purpose, the open source simulation tool CARLA is used; according to our knowledge this is a novel approach in the context of logistics simulation. First, CARLA is used to generate and visualize a realistic 3D outdoor warehouse scenario, incorporating a number of randomly moving forklifts. In a next step, intralogistics transport tasks, such as pick-and-place, are simulated for the forklift fleet, including shortest-path finding. Furthermore, the capability to play back localization data, previously recorded from a ''real'' forklift fleet, is demonstrated.This play back is done in the original recreated environment, thereby enabling the visualization of the forklifts movements. Finally, the energy consumption of the forklift trucks is simulated by integrating a physical battery model that generates the state of charge (SOC) of each truck as a function of load and activity. To demonstrate the wide range of possible applications for the CARLA simulation platform, we describe two use cases. The first deals with the problem of detecting regions with critically high traffic densities, the second with optimal placement of charging stations for the forklift trucks. Both use cases are calculated for an exemplary warehouse model.