To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
–Neural Information Processing Systems
Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is likely to be approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation.
Neural Information Processing Systems
Jan-19-2025, 20:21:31 GMT
- Technology: