Parmar, Jupinder
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models
Parmar, Jupinder, Satheesh, Sanjev, Patwary, Mostofa, Shoeybi, Mohammad, Catanzaro, Bryan
In our experiments, we start on top of a 15B parameter LM that has seen 8T tokens of pretraining Language modeling abilities have seen massive data (Parmar et al., 2024). Experimenting with a improvements over the past few years (Brown well trained model of this scale ensures that our et al., 2020; Chowdhery et al., 2022; OpenAI, 2024; findings will be transferable to most settings and Team, 2024). While these advancements have enabled model sizes. We first identify the type of data distribution language models (LMs) to become highlyskilled that should be used during continued pretraining conversational agents (OpenAI, 2024; Anthropic, and find that it is optimal to have two distributions, 2024; Team, 2024), they have come with with the final one more heavily weighting increased computational cost as pretraining has become data sources that relate to the abilities we want to ever more expensive due to both the number improve in the model. Second, we determine what of model parameters (Team et al., 2024; DeepSeek-learning rate schedules enable the most efficient AI et al., 2024) and pretraining dataset size (Touvron learning during continued pretraining and determine et al., 2023; Gemma Team, 2024; Parmar et al., that the most performant one strikes a balance 2024) continuing to grow in scale. With new LMs between magnitude of learning rate and steepness that set state of the art accuracy being released of decay. Lastly, we show how the learning rate on a frequent basis, LMs developed only a couple value at which we switch between data distributions months back are becoming obsolete as their affects downstream accuracy and identify the capabilities are no longer up to par. This leaves point at which this switch should be made.
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Parmar, Jupinder, Prabhumoye, Shrimai, Jennings, Joseph, Liu, Bo, Jhunjhunwala, Aastha, Wang, Zhilin, Patwary, Mostofa, Shoeybi, Mohammad, Catanzaro, Bryan
The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These findings constitute an actionable set of steps that practitioners can use to develop high quality pretraining sets.
Nemotron-4 340B Technical Report
Nvidia, null, :, null, Adler, Bo, Agarwal, Niket, Aithal, Ashwath, Anh, Dong H., Bhattacharya, Pallab, Brundyn, Annika, Casper, Jared, Catanzaro, Bryan, Clay, Sharon, Cohen, Jonathan, Das, Sirshak, Dattagupta, Ayush, Delalleau, Olivier, Derczynski, Leon, Dong, Yi, Egert, Daniel, Evans, Ellie, Ficek, Aleksander, Fridman, Denys, Ghosh, Shaona, Ginsburg, Boris, Gitman, Igor, Grzegorzek, Tomasz, Hero, Robert, Huang, Jining, Jawa, Vibhu, Jennings, Joseph, Jhunjhunwala, Aastha, Kamalu, John, Khan, Sadaf, Kuchaiev, Oleksii, LeGresley, Patrick, Li, Hui, Liu, Jiwei, Liu, Zihan, Long, Eileen, Mahabaleshwarkar, Ameya Sunil, Majumdar, Somshubra, Maki, James, Martinez, Miguel, de Melo, Maer Rodrigues, Moshkov, Ivan, Narayanan, Deepak, Narenthiran, Sean, Navarro, Jesus, Nguyen, Phong, Nitski, Osvald, Noroozi, Vahid, Nutheti, Guruprasad, Parisien, Christopher, Parmar, Jupinder, Patwary, Mostofa, Pawelec, Krzysztof, Ping, Wei, Prabhumoye, Shrimai, Roy, Rajarshi, Saar, Trisha, Sabavat, Vasanth Rao Naik, Satheesh, Sanjeev, Scowcroft, Jane Polak, Sewall, Jason, Shamis, Pavel, Shen, Gerald, Shoeybi, Mohammad, Sizer, Dave, Smelyanskiy, Misha, Soares, Felipe, Sreedhar, Makesh Narsimhan, Su, Dan, Subramanian, Sandeep, Sun, Shengyang, Toshniwal, Shubham, Wang, Hao, Wang, Zhilin, You, Jiaxuan, Zeng, Jiaqi, Zhang, Jimmy, Zhang, Jing, Zhang, Vivienne, Zhang, Yian, Zhu, Chen
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
Nemotron-4 15B Technical Report
Parmar, Jupinder, Prabhumoye, Shrimai, Jennings, Joseph, Patwary, Mostofa, Subramanian, Sandeep, Su, Dan, Zhu, Chen, Narayanan, Deepak, Jhunjhunwala, Aastha, Dattagupta, Ayush, Jawa, Vibhu, Liu, Jiwei, Mahabaleshwarkar, Ameya, Nitski, Osvald, Brundyn, Annika, Maki, James, Martinez, Miguel, You, Jiaxuan, Kamalu, John, LeGresley, Patrick, Fridman, Denys, Casper, Jared, Aithal, Ashwath, Kuchaiev, Oleksii, Shoeybi, Mohammad, Cohen, Jonathan, Catanzaro, Bryan
For example, (Hoffmann et al., 2022) shows that given two roughly IsoFLOP GPT models with a similar data distribution, a 65-billion-parameter model on 1.4 trillion tokens and a 280-billion-parameter model on 300 billion tokens, the 65B model has better accuracy on downstream tasks. This trade-off of allocating compute towards training on more data as opposed to increasing model size is particularly appealing from an inference perspective, reducing latency and the amount of compute needed to serve models. As a consequence, a major focus of language modeling training efforts has shifted to collecting high-quality multi-trillion token datasets from public sources such as Common Crawl.