U-DiT Policy: U-shaped Diffusion Transformers for Robotic Manipulation
Wu, Linzhi, Mei, Aoran, Wang, Xiyue, Zhu, Guo-Niu, Gan, Zhongxue
–arXiv.org Artificial Intelligence
Abstract-- Diffusion-based methods have been acknowledged as a powerful paradigm for end-to-end visuomotor control in robotics. Most existing approaches adopt a Diffusion Policy in U-Net architecture (DP-U), which, while effective, suffers from limited global context modeling and over-smoothing artifacts. T o address these issues, we propose U-DiT Policy, a novel U-shaped Diffusion Transformer framework. U-DiT preserves the multi-scale feature fusion advantages of U-Net while integrating the global context modeling capability of Transformers, thereby enhancing representational power and policy expressiveness. We evaluate U-DiT extensively across both simulation and real-world robotic manipulation tasks. In simulation, U-DiT achieves an average performance gain of 10% over baseline methods and surpasses Transformer-based diffusion policies (DP-T) that use AdaLN blocks by 6% under comparable parameter budgets. On real-world robotic tasks, U-DiT demonstrates superior generalization and robustness, achieving an average improvement of 22.5% over DP-U. Imitation learning [1] has emerged as a prominent data-driven and sample-efficient approach for programming robots from expert demonstrations. Within this paradigm, behavior cloning is typically formulated as a supervised regression task that maps observations to corresponding actions.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Asia > China
- North America > Mexico
- Gulf of Mexico (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: