QuRating: Selecting High-Quality Data for Training Language Models
Wettig, Alexander, Gupta, Aatmik, Malik, Saumya, Chen, Danqi
–arXiv.org Artificial Intelligence
Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We find that LLMs are able to discern these qualities and observe that they are better at making pairwise judgments of texts than at rating the quality of a text directly. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity, as selecting only the highest-rated documents leads to poor results. When we sample using quality ratings as logits over documents, our models achieve lower perplexity and stronger in-context learning performance than baselines. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.
arXiv.org Artificial Intelligence
Feb-15-2024
- Country:
- Oceania > Australia (0.45)
- Africa > Middle East (0.45)
- South America > Venezuela (0.14)
- North America
- Mexico (0.27)
- Cuba (0.13)
- United States
- Texas (0.67)
- Maryland (0.27)
- Washington (0.27)
- Colorado (0.14)
- Massachusetts (0.14)
- Illinois > Cook County (0.13)
- District of Columbia > Washington (0.13)
- Oregon (0.13)
- Florida (0.13)
- Indiana (0.13)
- Louisiana (0.13)
- Minnesota > Hennepin County
- Minneapolis (0.13)
- California > San Francisco County
- San Francisco (0.13)
- Arizona > Pima County
- Tucson (0.13)
- New York > New York County
- New York City (0.27)
- Canada
- Europe
- Asia
- Russia (0.45)
- India (0.28)
- Japan (0.27)
- Afghanistan (0.27)
- China (0.14)
- Philippines (0.14)
- Pakistan (0.13)
- Middle East
- Republic of Türkiye (0.67)
- Israel (0.45)
- Syria (0.27)
- Iraq (0.27)
- UAE (0.14)
- Palestine > Gaza Strip
- Gaza Governorate > Gaza (0.13)
- Genre:
- Personal (1.00)
- Instructional Material (1.00)
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Industry:
- Food & Agriculture (1.00)
- Banking & Finance > Trading (1.00)
- Retail (1.00)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.92)
- Law Enforcement & Public Safety
- Terrorism (1.00)
- Crime Prevention & Enforcement (1.00)
- Leisure & Entertainment
- Information Technology
- Services (1.00)
- Security & Privacy (1.00)
- Energy
- Power Industry (0.92)
- Oil & Gas > Downstream (0.92)
- Law
- Statutes (1.00)
- Litigation (1.00)
- Government & the Courts (1.00)
- Criminal Law (1.00)
- Health & Medicine
- Pharmaceuticals & Biotechnology (1.00)
- Health Care Providers & Services (1.00)
- Epidemiology (1.00)
- Consumer Health (1.00)
- Therapeutic Area
- Psychiatry/Psychology (1.00)
- Oncology (1.00)
- Neurology (1.00)
- Infections and Infectious Diseases (1.00)
- Immunology (1.00)
- Government
- Voting & Elections (1.00)
- Military (1.00)
- Regional Government
- North America Government > United States Government (1.00)
- Asia Government > Middle East Government (1.00)
- Europe Government (0.92)
- Materials
- Metals & Mining (1.00)
- Chemicals > Commodity Chemicals
- Petrochemicals > Polymers & Plastics (0.67)
- Education
- Health & Safety > School Nutrition (1.00)
- Educational Setting (1.00)
- Transportation
- Media
- Television (1.00)
- News (1.00)
- Music (1.00)
- Film (1.00)
- Technology: