analysis & insight
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
McKinzie, Brandon, Gan, Zhe, Fauconnier, Jean-Philippe, Dodge, Sam, Zhang, Bowen, Dufter, Philipp, Shah, Dhruti, Du, Xianzhi, Peng, Futang, Weers, Floris, Belyi, Anton, Zhang, Haotian, Singh, Karanjeet, Kang, Doug, Jain, Ankur, Hè, Hongyu, Schwarzer, Max, Gunter, Tom, Kong, Xiang, Zhang, Aonan, Wang, Jianyu, Wang, Chong, Du, Nan, Lei, Tao, Wiseman, Sam, Yin, Guoli, Lee, Mark, Wang, Zirui, Pang, Ruoming, Grasch, Peter, Toshev, Alexander, Yang, Yinfei
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving stateof-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published multimodal pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
- North America > United States > Oklahoma (0.04)
- North America > United States > North Carolina (0.04)
- North America > United States > Arizona (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Brief Review -- Scaling Language Models: Methods, Analysis & Insights from Training Gopher
RMSNorm (Zhang and Sennrich, 2019) instead of LayerNorm, and The relative positional encoding scheme from Dai et al. (2019) is used rather than absolute positional encodings. Relative encodings permit us to evaluate on longer sequences than that is trained on, which improves the modelling of articles and books. The relative positional encoding scheme from Dai et al. (2019) is used rather than absolute positional encodings. Relative encodings permit us to evaluate on longer sequences than that is trained on, which improves the modelling of articles and books.