Influence-driven Curriculum Learning for Pre-training on Limited Data

Schoenegger, Loris, Thoma, Lukas, Blevins, Terra, Roth, Benjamin

Sep-29-2025–arXiv.org Artificial Intelligence

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Sep-29-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - Minnesota (0.28)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.70)
  - Machine Learning
    - Inductive Learning (1.00)
    - Neural Networks > Deep Learning (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found