1bf3dbbd6346f50627e2ab1795f90435-Paper-Conference.pdf

Neural Information Processing Systems 

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (µP) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether µP of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard µP to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that µP of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-α, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing µP methodologies. Leveraging this result, we systematically demonstrate that DiT-µP enjoys robust HP transferability. Notably, DiT-XL-2-µP with transferred learning rate achieves 2.9 faster convergence than the original DiT-XL-2.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found