DataComp-LM: In search of the next generation of training sets for language models Jeffrey Li* 1,2 Alex Fang* 1,2

Neural Information Processing Systems 

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.