DataComp-LM: Insearchofthenextgenerationof trainingsetsforlanguagemodels
–Neural Information Processing Systems
Asabaseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set.
Neural Information Processing Systems
Feb-8-2026, 17:45:09 GMT
- Country:
- Asia
- China > Hong Kong (0.04)
- Indonesia > Bali (0.04)
- Japan
- Honshū > Chūbu
- Toyama Prefecture > Toyama (0.04)
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū > Chūbu
- Middle East
- Israel > Tel Aviv District
- Tel Aviv (0.04)
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Israel > Tel Aviv District
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.04)
- Europe
- Austria (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany > Berlin (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Tuscany
- Florence (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- North America
- Canada > British Columbia
- United States
- New York > New York County
- New York City (0.04)
- California
- Kern County > Bakersfield (0.04)
- Los Angeles County > Los Angeles (0.14)
- Santa Barbara County > Santa Barbara (0.04)
- North Carolina > Mecklenburg County
- Charlotte (0.04)
- Ohio
- Clark County > Springfield (0.04)
- Cuyahoga County > Cleveland (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Maryland > Baltimore (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Texas > Travis County
- Austin (0.14)
- New York > New York County
- South America > Falkland Islands (0.04)
- Asia
- Genre:
- Research Report (1.00)
- Industry:
- Education (1.00)
- Government (1.00)
- Information Technology
- Security & Privacy (0.45)
- Software (0.45)
- Law (0.92)
- Leisure & Entertainment > Games (0.46)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Natural Language
- Chatbot (0.93)
- Large Language Model (1.00)
- Representation & Reasoning (1.00)
- Machine Learning > Neural Networks
- Communications > Social Media (0.93)
- Data Science (1.00)
- Information Management (1.00)
- Software (0.92)
- Artificial Intelligence
- Information Technology