DiTFastAttn: Attention Compression for Diffusion Transformer Models Pu Lu
–Neural Information Processing Systems
Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8 end-to-end speedup at high-resolution (2k 2k) generation.
Neural Information Processing Systems
May-28-2025, 06:51:47 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Workflow (0.93)
- Research Report
- Industry:
- Information Technology (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (1.00)
- Natural Language (1.00)
- Representation & Reasoning (0.93)
- Vision (1.00)
- Information Technology > Artificial Intelligence