RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, Songming, Wu, Lingxuan, Li, Bangguo, Tan, Hengkai, Chen, Huayu, Wang, Zhengyi, Xu, Ke, Su, Hang, Zhu, Jun

arXiv.org Artificial Intelligence 

Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zeroshot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1 5 demonstrations, and effectively handles complex, dexterous tasks. We refer to the project page for the code and videos. Bimanual manipulation is essential for robots to accomplish real-world tasks (Edsinger & Kemp, 2007). For practical applications, a useful manipulation policy should be able to generalize to unseen scenarios, such as unseen objects and scenes. Following the success in natural language processing (Achiam et al., 2023; Touvron et al., 2023) and computer vision (Radford et al., 2021; Kirillov et al., 2023), one promising direction to enable generalizable behaviors is to develop a foundation model through imitation learning on large-scale datasets. However, it is highly non-trivial to develop a bimanual manipulation foundation model. One main reason is that the accessible data for a specific dual-arm robot is significantly scarce (Sharma et al., 2018; Collaboration et al., 2023) due to high hardware costs, undermining the data-intensive requirements of training foundational models. Inspired by recent attempts in unimanual manipulation (Brohan et al., 2023; Kim et al., 2024), we seek to first pre-train on extensive multi-robot datasets and then fine-tune on the small dataset collected on the target dual-arm robot. This can help us to scale the data size up to three orders of magnitude, having the potential to learn transferrable physics knowledge from datasets of other robots. Nevertheless, there are two key technical challenges.