Essential-Web v1.0: 24T tokens of organized web data
AI, Essential, :, null, Hojel, Andrew, Pust, Michael, Romanski, Tim, Vanjani, Yash, Kapila, Ritvik, Parmar, Mohit, Chaluvaraju, Adarsh, Tripathy, Alok, Thomas, Anil, Tanwer, Ashish, Shah, Darsh J, Shah, Ishaan, Stratos, Karl, Nguyen, Khoi, Smith, Kurt, Callahan, Michael, Rushton, Peter, Monk, Philip, Mazarakis, Platon, Jamal, Saad, Srivastava, Saurabh, Singla, Somanshu, Vaswani, Ashish
–arXiv.org Artificial Intelligence
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
arXiv.org Artificial Intelligence
Jun-23-2025
- Country:
- North America > United States (0.67)
- Asia (0.46)
- Genre:
- Research Report > New Finding (0.92)
- Instructional Material > Course Syllabus & Notes (0.67)
- Industry:
- Information Technology (0.67)
- Education > Curriculum
- Subject-Specific Education (0.45)
- Technology: