1bf3dbbd6346f50627e2ab1795f90435-Paper-Conference.pdf

Jun-15-2026, 10:05:54 GMT–Neural Information Processing Systems

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (µP) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether µP of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard µP to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that µP of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-α, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing µP methodologies. Leveraging this result, we systematically demonstrate that DiT-µP enjoys robust HP transferability. Notably, DiT-XL-2-µP with transferred learning rate achieves 2.9 faster convergence than the original DiT-XL-2.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Jun-15-2026, 10:05:54 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found