Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

Parmar, Jupinder, Satheesh, Sanjev, Patwary, Mostofa, Shoeybi, Mohammad, Catanzaro, Bryan

Jul-9-2024–arXiv.org Artificial Intelligence

In our experiments, we start on top of a 15B parameter LM that has seen 8T tokens of pretraining Language modeling abilities have seen massive data (Parmar et al., 2024). Experimenting with a improvements over the past few years (Brown well trained model of this scale ensures that our et al., 2020; Chowdhery et al., 2022; OpenAI, 2024; findings will be transferable to most settings and Team, 2024). While these advancements have enabled model sizes. We first identify the type of data distribution language models (LMs) to become highlyskilled that should be used during continued pretraining conversational agents (OpenAI, 2024; Anthropic, and find that it is optimal to have two distributions, 2024; Team, 2024), they have come with with the final one more heavily weighting increased computational cost as pretraining has become data sources that relate to the abilities we want to ever more expensive due to both the number improve in the model. Second, we determine what of model parameters (Team et al., 2024; DeepSeek-learning rate schedules enable the most efficient AI et al., 2024) and pretraining dataset size (Touvron learning during continued pretraining and determine et al., 2023; Gemma Team, 2024; Parmar et al., that the most performant one strikes a balance 2024) continuing to grow in scale. With new LMs between magnitude of learning rate and steepness that set state of the art accuracy being released of decay. Lastly, we show how the learning rate on a frequent basis, LMs developed only a couple value at which we switch between data distributions months back are becoming obsolete as their affects downstream accuracy and identify the capabilities are no longer up to par. This leaves point at which this switch should be made.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jul-9-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.44)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found