AITopics | dclm-baseline

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Neural Information Processing SystemsJun-15-2026, 21:09:28 GMT

Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

DataComp-LM: Insearchofthenextgenerationof trainingsetsforlanguagemodels

Neural Information Processing SystemsFeb-8-2026, 17:45:09 GMT

Asabaseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set.

large language model, machine learning, urlhttp, (22 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > Texas > Travis County > Austin (0.14)
(27 more...)

Genre: Research Report (1.00)

Industry:

Government (1.00)
Education (1.00)
Law (0.92)
(3 more...)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(5 more...)

Add feedback

PCMind-2.1-Kaiyuan-2B Technical Report

Luo, Kairong, Sun, Zhenbo, Shi, Xinyu, Chen, Shengqi, Yu, Bowen, Chen, Yunyi, Dang, Chenyi, Tao, Hengtao, Wang, Hui, Liu, Fangming, Lyu, Kaifeng, Chen, Wenguang

arXiv.org Artificial IntelligenceDec-9-2025

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2512.07612

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Nguyen, Thao, Li, Yang, Golovneva, Olga, Zettlemoyer, Luke, Oh, Sewoong, Schmidt, Ludwig, Li, Xian

arXiv.org Artificial IntelligenceSep-16-2025

Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data. We make our high-quality synthetic data publicly available at https://huggingface.co/datasets/facebook/recycling_the_web.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2506.04689

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Surgery (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Fang, Alex, Pouransari, Hadi, Jordan, Matt, Toshev, Alexander, Shankar, Vaishaal, Schmidt, Ludwig, Gunter, Tom

arXiv.org Artificial IntelligenceMar-10-2025

Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.

dataset, dclm-baseline, refinedweb, (13 more...)

arXiv.org Artificial Intelligence

2503.07879

Country:

Asia > Middle East > Jordan (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

dclm-baseline

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

DataComp-LM: Insearchofthenextgenerationof trainingsetsforlanguagemodels

PCMind-2.1-Kaiyuan-2B Technical Report

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality