U-DiT Policy: U-shaped Diffusion Transformers for Robotic Manipulation

Wu, Linzhi, Mei, Aoran, Wang, Xiyue, Zhu, Guo-Niu, Gan, Zhongxue

Sep-30-2025–arXiv.org Artificial Intelligence

Abstract-- Diffusion-based methods have been acknowledged as a powerful paradigm for end-to-end visuomotor control in robotics. Most existing approaches adopt a Diffusion Policy in U-Net architecture (DP-U), which, while effective, suffers from limited global context modeling and over-smoothing artifacts. T o address these issues, we propose U-DiT Policy, a novel U-shaped Diffusion Transformer framework. U-DiT preserves the multi-scale feature fusion advantages of U-Net while integrating the global context modeling capability of Transformers, thereby enhancing representational power and policy expressiveness. We evaluate U-DiT extensively across both simulation and real-world robotic manipulation tasks. In simulation, U-DiT achieves an average performance gain of 10% over baseline methods and surpasses Transformer-based diffusion policies (DP-T) that use AdaLN blocks by 6% under comparable parameter budgets. On real-world robotic tasks, U-DiT demonstrates superior generalization and robustness, achieving an average improvement of 22.5% over DP-U. Imitation learning [1] has emerged as a prominent data-driven and sample-efficient approach for programming robots from expert demonstrations. Within this paradigm, behavior cloning is typically formulated as a supervised regression task that maps observations to corresponding actions.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Natural Language > Large Language Model (0.66)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)