To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Jan-19-2025, 20:21:31 GMT–Neural Information Processing Systems

Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is likely to be approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation.

multi-epoch degradation, scaling llm, token-crisis, (3 more...)

Neural Information Processing Systems

Jan-19-2025, 20:21:31 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (1.00)