AITopics | refinedweb

Country:

North America > United States > Massachusetts (0.04)
North America > United States > Florida > Martin County > Stuart (0.04)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology (1.00)
Banking & Finance > Real Estate (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
(2 more...)

Neural Information Processing SystemsFeb-18-2026, 02:00:35 GMT

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only The Falcon LLM Team

Do not distribute. 1 Introduction

large language model, machine learning, natural language, (18 more...)

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(2 more...)

Genre:

Research Report (0.68)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsOct-9-2025, 12:25:42 GMT

A NeurIPS Datasets and Benchmark Checklist

If it is a counter (e.g.

large language model, machine learning, natural language, (24 more...)

Country:

North America > United States > Massachusetts (0.04)
North America > United States > Florida > Martin County > Stuart (0.04)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology (1.00)
Banking & Finance > Real Estate (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
(2 more...)

Neural Information Processing SystemsOct-9-2025, 12:25:39 GMT

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only The Falcon LLM Team

Do not distribute. 1 Introduction

large language model, machine learning, natural language, (17 more...)

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(2 more...)

Genre:

Research Report (0.68)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

arXiv.org Artificial IntelligenceMar-10-2025

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Fang, Alex, Pouransari, Hadi, Jordan, Matt, Toshev, Alexander, Shankar, Vaishaal, Schmidt, Ludwig, Gunter, Tom

Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.

dataset, dclm-baseline, refinedweb, (13 more...)

2503.07879

Country:

Asia > Middle East > Jordan (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Mansour, Youssef, Heckel, Reinhard

Measuring Bias of Web-filtered Text Datasets and Bias Propagation Through Training

arXiv.org Artificial IntelligenceDec-3-2024

We investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar filtering and deduplication steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints. Those biases remain even when the text is rewritten with LLMs. Moreover, these biases propagate through training: Random sequences generated by models trained on those datasets can be classified well by a classifier trained on the original datasets.

large language model, machine learning, natural language, (19 more...)

2412.02857

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > Mexico (0.04)
(6 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Materials > Chemicals (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Law Enforcement & Public Safety (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Herold, Christian, Kozielski, Michael, Ekimov, Leonid, Petrushkov, Pavel, Vandenbussche, Pierre-Yves, Khadivi, Shahram

LiLiuM: eBay's Large Language Models for e-commerce

arXiv.org Artificial IntelligenceJun-17-2024

We introduce the LiLiuM series of large language models (LLMs): 1B, 7B, and 13B parameter models developed 100% in-house to fit eBay's specific needs in the e-commerce domain. This gives eBay full control over all aspects of the models including license, data, vocabulary, and architecture. We expect these models to be used as a foundation for fine-tuning and instruction-tuning, eliminating dependencies to external models. The LiLiuM LLMs have been trained on 3 trillion tokens of multilingual text from general and e-commerce domain. They perform similar to the popular LLaMA-2 models on English natural language understanding (NLU) benchmarks. At the same time, we outperform LLaMA-2 on non-English NLU tasks, machine translation and on e-commerce specific downstream tasks. As part of our data mixture, we utilize the newly released RedPajama-V2 dataset for training and share our insights regarding data filtering and deduplication. We also discuss in detail how to serialize structured data for use in autoregressive language modeling. We provide insights on the effects of including code and parallel machine translation data in pre-training. Furthermore, we develop our own tokenizer and model vocabulary, customized towards e-commerce. This way, we can achieve up to 34% speed-up in text generation on eBay-specific downstream tasks compared to LLaMA-2. Finally, in relation to LLM pretraining, we show that checkpoint averaging can further improve over the best individual model checkpoint.

checkpoint, computational linguistic, refinedweb, (15 more...)

2406.12023

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(15 more...)

Genre: Research Report (0.50)

Industry: Information Technology > Services > e-Commerce Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Tokpanov, Yury, Millidge, Beren, Glorioso, Paolo, Pilault, Jonathan, Ibrahim, Adam, Whittington, James, Anthony, Quentin

Zyda: A 1.3T Dataset for Open Language Modeling

arXiv.org Artificial IntelligenceJun-4-2024

The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.

large language model, machine learning, natural language, (20 more...)

2406.01981

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

arXiv.org Artificial IntelligenceMar-17-2024

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

Qiu, Jiantao, Lv, Haijun, Jin, Zhenjiang, Wang, Rui, Ning, Wenchang, Yu, Jia, Zhang, ChaoBin, Li, Zhenxiang, Chu, Pei, Qu, Yuan, Shi, Jin, Lu, Lindong, Peng, Runyu, Zeng, Zhiyuan, Tang, Huanze, Lei, Zhikai, Hong, Jiawei, Chen, Keyu, Fei, Zhaoye, Xu, Ruiliang, Li, Wei, Tu, Zhongying, Dahua, Lin, Qiao, Yu, Yan, Hang, He, Conghui

This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks.

data quality, dataset, language model, (15 more...)

2402.19282

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
(2 more...)