DataComp-LM: In search of the next generation of training sets for language models Jeffrey Li* 1,2 Alex Fang* 1,2
–Neural Information Processing Systems
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Neural Information Processing Systems
May-28-2025, 15:21:44 GMT
- Country:
- Asia (1.00)
- Europe (1.00)
- North America > United States
- California (1.00)
- Texas > Travis County
- Austin (0.27)
- Genre:
- Research Report
- Experimental Study (0.87)
- New Finding (0.92)
- Research Report
- Industry:
- Education (1.00)
- Government (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (1.00)
- Leisure & Entertainment > Games (0.92)
- Technology: