Scaling Laws for Generative Mixed-Modal Language Models

Aghajanyan, Armen, Yu, Lili, Conneau, Alexis, Hsu, Wei-Ning, Hambardzumyan, Karen, Zhang, Susan, Roller, Stephen, Goyal, Naman, Levy, Omer, Zettlemoyer, Luke

arXiv.org Artificial Intelligence 

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speechtext model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties. Generative language models have been developed for a wide range of data modalities, including natural language text Brown et al. (2020), code (Chen et al., 2021; Fried et al., 2022), images (Ramesh et al., 2021; Yasunaga et al., 2022), and molecules or proteins (Chilingaryan et al., 2022; Hsu et al., 2022). Recent work has also introduced unified models (Aghajanyan et al., 2022; Reed et al., 2022; Wang et al., 2022; Zellers et al., 2022) that can simultaneously model multiple modalities. One advantage of generative modeling in these cases is that the models scale well in practice; adding data, compute, or parameters typically improves model quality. These scaling trends have been carefully studied for uni-modal models (Kaplan et al., 2020; Hoffmann et al., 2022) and some recent work focuses on pairs of modalities (Droppo & Elibol, 2021; Henighan et al., 2020).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found