Majumder, Orchid
Efficient Scaling of Diffusion Transformers for Text-to-Image Generation
Li, Hao, Lal, Shamit, Li, Zhiheng, Xie, Yusheng, Wang, Ying, Zou, Yang, Majumder, Orchid, Manmatha, R., Tu, Zhuowen, Ermon, Stefano, Soatto, Stefano, Swaminathan, Ashwin
Figure 1: Examples of high-resolution images generated by a 2.3B U-ViT 1K model. We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with crossattention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency. Transformer (Vaswani et al., 2017)'s straightforward design and ability to scale efficiently has driven significant advancements in large language models (LLMs) (Kaplan et al., 2020). Its inherent simplicity and ease of parallelization makes it well-suited for hardware acceleration. Despite the rapid evolution of DiT models, a comprehensive comparison between various DiT architectures and UNet-based models for text-to-image generation (T2I) is still lacking. Furthermore, the optimal scaling strategy for transformer models in T2I tasks compared to UNet is yet to be determined. The challenge of establishing a fair comparison is further compounded by the variation in training settings and the significant computational resources required to train these models.
On the Scalability of Diffusion-based Text-to-Image Generation
Li, Hao, Zou, Yang, Wang, Ying, Majumder, Orchid, Xie, Yusheng, Manmatha, R., Swaminathan, Ashwin, Tu, Zhuowen, Ermon, Stefano, Soatto, Stefano
Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.
Estimating informativeness of samples with Smooth Unique Information
Harutyunyan, Hrayr, Achille, Alessandro, Paolini, Giovanni, Majumder, Orchid, Ravichandran, Avinash, Bhotika, Rahul, Soatto, Stefano
We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We apply these measures to several problems, such as dataset summarization, analysis of under-sampled classes, comparison of informativeness of different data sources, and detection of adversarial and corrupted examples. Our work generalizes existing frameworks but enjoys better computational properties for heavily overparametrized models, which makes it possible to apply it to real-world networks. Training a deep neural network (DNN) entails extracting information from samples in a dataset and storing it in the weights of the network, so that it may be used in future inference or prediction. But how much information does a particular sample contribute to the trained model? The answer can be used to provide strong generalization bounds (if no information is used, the network is not memorizing the sample), privacy bounds (how much information the network can leak about a particular sample), and enable better interpretation of the training process and its outcome. To determine the information content of samples, we need to define and compute information. In the classical sense, information is a property of random variables, which may be degenerate for the deterministic process of computing the output of a trained DNN in response to a given input (inference). So, even posing the problem presents some technical challenges.
Scheduling the Learning Rate via Hypergradients: New Insights and a New Algorithm
Donini, Michele, Franceschi, Luca, Pontil, Massimiliano, Majumder, Orchid, Frasconi, Paolo
We study the problem of fitting task-specific learning rate schedules from the perspective of hyperparameter optimization. This allows us to explicitly search for schedules that achieve good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate, the hypergradient, and based on this we introduce a novel online algorithm. Research in this direction is vast (see Hutter et al. (2019) for an overview) and includes model-based (Snoek et al., 2012; Hutter et al., 2015), model-free (Bergstra & Bengio, 2012; Hansen, 2016), and gradient-based (Domke, 2012; Maclaurin et al., 2015) approaches. Problem (1) can be in principle solved by any HPO technique.