MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Feb-18-2026, 00:42:18 GMT–Neural Information Processing Systems

Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MA TES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MA

large language model, machine learning, oracle data, (19 more...)

Neural Information Processing Systems

Feb-18-2026, 00:42:18 GMT

Conferences PDF

Add feedback

Country:
- Europe (0.04)
- North America > United States
  - Michigan (0.04)
  - Pennsylvania > Allegheny County
    - Pittsburgh (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Education (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.70)
  - Machine Learning > Neural Networks (0.67)

Duplicate Docs Excel Report

Title
c4bec0d2fd217e6c2c3eafeced432582-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found