Simple and Scalable Strategies to Continually Pre-train Large Language Models

Ibrahim, Adam, Thérien, Benjamin, Gupta, Kshitij, Richter, Mats L., Anthony, Quentin, Lesort, Timothée, Belilovsky, Eugene, Rish, Irina

Mar-26-2024–arXiv.org Artificial Intelligence

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Mar-26-2024

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America
  - Canada > Quebec
    - Montreal (0.14)
  - United States (1.00)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (1.00)
- Energy (0.92)
- Government > Regional Government
  - North America Government > United States Government (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found