The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

May-23-2025, 06:31:24 GMT–Neural Information Processing Systems

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-ofthe-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including indepth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

May-23-2025, 06:31:24 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Instructional Material (0.92)
- Research Report (0.67)

Industry:
- Education > Educational Setting (0.67)
- Health & Medicine > Consumer Health (0.46)
- Media > News (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.89)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found