DataComp-LM: In search of the next generation of training sets for language models Jeffrey Li* 1, 2 Alex Fang
–Neural Information Processing Systems
As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set.
Neural Information Processing Systems
Nov-14-2025, 05:13:30 GMT
- Country:
- Asia
- China > Hong Kong (0.04)
- Indonesia > Bali (0.04)
- Japan
- Honshū > Chūbu
- Toyama Prefecture > Toyama (0.04)
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū > Chūbu
- Middle East
- Israel > Tel Aviv District
- Tel Aviv (0.04)
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Israel > Tel Aviv District
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.04)
- Europe
- Austria (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany
- Bavaria > Upper Bavaria
- Munich (0.04)
- Berlin (0.04)
- Bavaria > Upper Bavaria
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Tuscany
- Florence (0.04)
- Spain (0.04)
- North America
- Canada > British Columbia
- Vancouver (0.04)
- Dominican Republic (0.04)
- United States
- California
- Kern County > Bakersfield (0.04)
- Los Angeles County
- Long Beach (0.04)
- Los Angeles (0.14)
- San Diego County > San Diego (0.04)
- Santa Barbara County > Santa Barbara (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Maryland > Baltimore (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- North Carolina > Mecklenburg County
- Charlotte (0.04)
- Ohio
- Clark County > Springfield (0.04)
- Cuyahoga County > Cleveland (0.04)
- Texas > Travis County
- Austin (0.27)
- California
- Canada > British Columbia
- South America > Falkland Islands (0.04)
- Asia
- Genre:
- Research Report
- Experimental Study (0.67)
- New Finding (0.92)
- Research Report
- Industry:
- Education (1.00)
- Government (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (1.00)
- Leisure & Entertainment > Games (0.92)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (1.00)
- Performance Analysis > Accuracy (1.00)
- Statistical Learning (0.67)
- Natural Language
- Chatbot (1.00)
- Large Language Model (1.00)
- Representation & Reasoning > Commonsense Reasoning (0.93)
- Machine Learning
- Communications > Social Media (1.00)
- Data Science
- Data Mining (1.00)
- Data Quality (1.00)
- Software > Programming Languages (1.00)
- Artificial Intelligence
- Information Technology