Essential-Web v1.0: 24T tokens of organized web data

AI, Essential, :, null, Hojel, Andrew, Pust, Michael, Romanski, Tim, Vanjani, Yash, Kapila, Ritvik, Parmar, Mohit, Chaluvaraju, Adarsh, Tripathy, Alok, Thomas, Anil, Tanwer, Ashish, Shah, Darsh J, Shah, Ishaan, Stratos, Karl, Nguyen, Khoi, Smith, Kurt, Callahan, Michael, Rushton, Peter, Monk, Philip, Mazarakis, Platon, Jamal, Saad, Srivastava, Saurabh, Singla, Somanshu, Vaswani, Ashish

Jun-23-2025–arXiv.org Artificial Intelligence

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Jun-23-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.67)
- Asia (0.46)

Genre:
- Research Report > New Finding (0.92)
- Instructional Material > Course Syllabus & Notes (0.67)

Industry:
- Information Technology (0.67)
- Education > Curriculum
  - Subject-Specific Education (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (0.92)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.69)
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Statistical Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found