The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
–Neural Information Processing Systems
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-ofthe-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including indepth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb.
Neural Information Processing Systems
May-23-2025, 06:31:24 GMT
- Country:
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Genre:
- Instructional Material (0.92)
- Research Report (0.67)
- Industry:
- Education > Educational Setting (0.67)
- Health & Medicine > Consumer Health (0.46)
- Media > News (0.46)
- Technology: