Goto

Collaborating Authors

 embodiment




Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Neural Information Processing Systems

One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (liruiw.github.io/hpt)


Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models

Neural Information Processing Systems

Unlike most reinforcement learning agents which require an unrealistic amount of environment interactions to learn a new behaviour, humans excel at learning quickly by merely observing and imitating others. This ability highly depends on the fact that humans have a model of their own embodiment that allows them to infer the most likely actions that led to the observed behaviour. In this paper, we propose Action Inference by Maximising Evidence (AIME) to replicate this behaviour using world models. AIME consists of two distinct phases. In the first phase, the agent learns a world model from its past experience to understand its own body by maximising the ELBO.


PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning

Neural Information Processing Systems

Designing generalizable agents capable of adapting to diverse embodiments has achieved significant attention in Reinforcement Learning (RL), which is critical for deploying RL agents in various real-world applications. Previous Cross-Embodiment RL approaches have focused on transferring knowledge across embodiments within specific tasks. These methods often result in knowledge tightly coupled with those tasks and fail to adequately capture the distinct characteristics of different embodiments. To address this limitation, we introduce the notion of Cross-Embodiment Unsupervised RL (CEURL), which leverages unsupervised learning to enable agents to acquire embodiment-aware and task-agnostic knowledge through online interactions within reward-free environments. We formulate CEURL as a novel Controlled Embodiment Markov Decision Process (CE-MDP) and systematically analyze CEURL's pre-training objectives under CE-MDP. Based on these analyses, we develop a novel algorithm Pre-trained Embodiment-Aware Control (PEAC) for handling CEURL, incorporating an intrinsic reward function specifically designed for cross-embodiment pre-training. PEAC not only provides an intuitive optimization strategy for cross-embodiment pre-training but also can integrate flexibly with existing unsupervised RL methods, facilitating cross-embodiment exploration and skill discovery. Extensive experiments in both simulated (e.g., DMC and Robosuite) and real-world environments (e.g., legged locomotion) demonstrate that PEAC significantly improves adaptation performance and cross-embodiment generalization, demonstrating its effectiveness in overcoming the unique challenges of CEURL.


UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies

Gupta, Harsh, Guo, Xiaofeng, Ha, Huy, Pan, Chuer, Cao, Muqing, Lee, Dongjae, Scherer, Sebastian, Song, Shuran, Shi, Guanya

arXiv.org Artificial Intelligence

We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments-such as aerial manipulators-is the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controller's tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on multiple long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverse-and even highly constrained-embodiments. All code, data, and checkpoints will be publicly released after acceptance. Result videos can be found at umi-on-air.github.io.


SwarmDiffusion: End-To-End Traversability-Guided Diffusion for Embodiment-Agnostic Navigation of Heterogeneous Robots

Zhura, Iana, Karaf, Sausar, Batool, Faryal, Mudalige, Nipun Dhananjaya Weerakkodi, Serpiva, Valerii, Abdulkarim, Ali Alridha, Fedoseev, Aleksey, Seyidov, Didar, Amjad, Hajira, Tsetserukou, Dzmitry

arXiv.org Artificial Intelligence

Abstract--Visual traversability estimation is critical for autonomous navigation, but existing VLM-based methods rely on hand-crafted prompts, generalize poorly across embodiments, and output only traversability maps, leaving trajectory generation to slow external planners. We propose SwarmDiffusion, a lightweight end-to-end diffusion model that jointly predicts traversability and generates a feasible trajectory from a single RGB image. T o remove the need for annotated or planner-produced paths, we introduce a planner-free trajectory construction pipeline based on randomized way-point sampling, B ezier smoothing, and regularization enforcing connectivity, safety, directionality, and path thinness. This enables learning stable motion priors without demonstrations. SwarmDiffusion leverages VLM-derived supervision without prompt engineering and conditions the diffusion process on a compact embodiment state, producing physically consistent, traversable paths that transfer across different robot platforms. Across indoor environments and two embodiments (quadruped and aerial), the method achieves 80-100% navigation success and 0.09s inference, and adapts to a new robot using only 500 additional visual samples. ELIABLE indoor navigation is fundamental to a wide range of robotic applications, including warehouse automation [1], industrial inspection [2], search and rescue, and autonomous logistics. In these settings, robots must continuously reason about where they can safely move and how to plan a feasible trajectory through cluttered, unstructured, and dynamic spaces.


FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Liu, Yicheng, Zhang, Shiduo, Dong, Zibin, Ye, Baijun, Yuan, Tianyuan, Yu, Xiaopeng, Yin, Linqi, Lu, Chenhao, Shi, Junhao, Yu, Luca Jiang-Tao, Zheng, Liangtao, Jiang, Tao, Gong, Jingjing, Qiu, Xipeng, Zhao, Hang

arXiv.org Artificial Intelligence

UCSD Figure 1: F AST er combines a learnable action tokenizer (FASTerVQ) and an autoregressive VLA model (FASTerVLA), achieving efficient compression, fast control, and strong performance across eight real and simulated embodiments. Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce F AST er, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autore-gressive policy built upon it. FASTerVLA builds on this tokenizer with block-wise autore-gressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance. Vision-Language-Action (VLA) models represent a paradigm shift in robotics, embodying generalist robot policies trained on increasingly large-scale robotic datasets (Chenjia Bai, 2024). These models are categorized primarily by their method of robot action prediction, with the most prominent approaches being diffusion-based (Team et al., 2024; Black et al., 2024) and autoregressive VLA (Belkhale & Sadigh, 2024; Kim et al., 2024; Pertsch et al., 2025; Zhou et al., 2025) models. While diffusion-based models have demonstrated superior precision in manipulation tasks, they often exhibit a notable deficiency in leveraging critical visual and linguistic cues (Pertsch et al., 2025; Dong et al., 2025). In contrast, recent research indicates that a carefully designed autoregres-sive VLA model can increasingly bridge the performance gap with its diffusion-based counterparts, while simultaneously offering enhanced instruction-following capabilities (Pertsch et al., 2025; Intelligence et al., 2025; Hancock et al., 2025), superior scene generalization (Pertsch et al., 2025), and effective transfer of common-sense knowledge (Brohan et al., 2023). Most importantly, autoregres-sive VLA models share the most architectural similarity to the highly successful Vision-Language Models (VLMs), suggesting significant potential for future advancements. A pivotal challenge within autoregressive VLA models is the development of an appropriate tok-enization scheme to discretize continuous robot action sequence into action tokens (Wang et al., 2025c; Pertsch et al., 2025). Numerous sequence modeling studies, including LLMs and Speech-LLMs, have demonstrated that tokenizer quality directly determines model performance (Radford et al., 2019; Zhang et al., 2023; Gong et al., 2025).


VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Shen, Yichao, Wei, Fangyun, Du, Zhiying, Liang, Yaobo, Lu, Yan, Yang, Jiaolong, Zheng, Nanning, Guo, Baining

arXiv.org Artificial Intelligence

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.


HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

Du, Zhiying, Liu, Bei, Liang, Yaobo, Shen, Yichao, Cao, Haidong, Zheng, Xiangyu, Feng, Zhiyuan, Wu, Zuxuan, Yang, Jiaolong, Jiang, Yu-Gang

arXiv.org Artificial Intelligence

The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at https://github.com/ZhiyingDu/HiMoE-VLA.