StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

Liu, Mingyu, Shu, Jiuhe, Chen, Hui, Li, Zeju, Zhao, Canyu, Yang, Jiange, Gao, Shenyuan, Chen, Hao, Shen, Chunhua

arXiv.org Artificial Intelligence 

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance of compactness and expressivity, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seam-lessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success rate with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally represents the motion, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures dynamics without explicit motion supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning robotic motions with complex temporal modeling and video data. Our learned representations also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. "What we observe as static is merely dynamic equilibrium. " -- Richard Feynman, The Feynman Lectures on Physics Learning reusable and generalizable representations is a cornerstone of intelligent robotics systems.