Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT
–Neural Information Processing Systems
Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers (Flag-DiT) that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers.
Neural Information Processing Systems
May-27-2025, 20:41:34 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.60)
- Natural Language (0.58)
- Vision (0.83)
- Information Technology > Artificial Intelligence