Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Li, Hao, Lal, Shamit, Li, Zhiheng, Xie, Yusheng, Wang, Ying, Zou, Yang, Majumder, Orchid, Manmatha, R., Tu, Zhuowen, Ermon, Stefano, Soatto, Stefano, Swaminathan, Ashwin

Dec-16-2024–arXiv.org Artificial Intelligence

Figure 1: Examples of high-resolution images generated by a 2.3B U-ViT 1K model. We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with crossattention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency. Transformer (Vaswani et al., 2017)'s straightforward design and ability to scale efficiently has driven significant advancements in large language models (LLMs) (Kaplan et al., 2020). Its inherent simplicity and ease of parallelization makes it well-suited for hardware acceleration. Despite the rapid evolution of DiT models, a comprehensive comparison between various DiT architectures and UNet-based models for text-to-image generation (T2I) is still lacking. Furthermore, the optimal scaling strategy for transformer models in T2I tasks compared to UNet is yet to be determined. The challenge of establishing a fair comparison is further compounded by the variation in training settings and the significant computational resources required to train these models.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Dec-16-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Germany (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)
    - Natural Language > Large Language Model (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found