u-dit
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > China (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (5 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention and bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL with only 1/6 of its computation cost.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > China (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (5 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
U-DiT Policy: U-shaped Diffusion Transformers for Robotic Manipulation
Wu, Linzhi, Mei, Aoran, Wang, Xiyue, Zhu, Guo-Niu, Gan, Zhongxue
Abstract-- Diffusion-based methods have been acknowledged as a powerful paradigm for end-to-end visuomotor control in robotics. Most existing approaches adopt a Diffusion Policy in U-Net architecture (DP-U), which, while effective, suffers from limited global context modeling and over-smoothing artifacts. T o address these issues, we propose U-DiT Policy, a novel U-shaped Diffusion Transformer framework. U-DiT preserves the multi-scale feature fusion advantages of U-Net while integrating the global context modeling capability of Transformers, thereby enhancing representational power and policy expressiveness. We evaluate U-DiT extensively across both simulation and real-world robotic manipulation tasks. In simulation, U-DiT achieves an average performance gain of 10% over baseline methods and surpasses Transformer-based diffusion policies (DP-T) that use AdaLN blocks by 6% under comparable parameter budgets. On real-world robotic tasks, U-DiT demonstrates superior generalization and robustness, achieving an average improvement of 22.5% over DP-U. Imitation learning [1] has emerged as a prominent data-driven and sample-efficient approach for programming robots from expert demonstrations. Within this paradigm, behavior cloning is typically formulated as a supervised regression task that maps observations to corresponding actions.
- North America > Mexico > Gulf of Mexico (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention and bring further improvements despite a considerable amount of reduction in computation.