AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
Lucy, Li, Gururangan, Suchin, Soldaini, Luca, Strubell, Emma, Bamman, David, Klein, Lauren, Dodge, Jesse
–arXiv.org Artificial Intelligence
Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage is under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.
arXiv.org Artificial Intelligence
Jan-16-2024
- Country:
- Africa
- Middle East > Egypt (0.04)
- Nigeria (0.04)
- North Africa (0.04)
- South Africa (0.04)
- Sub-Saharan Africa (0.04)
- Antarctica (0.04)
- Asia
- Indonesia > Bali (0.04)
- Malaysia (0.04)
- Central Asia (0.04)
- Japan (0.04)
- Middle East
- Israel (0.04)
- Jordan (0.04)
- Republic of Türkiye (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- China (0.04)
- Taiwan (0.04)
- Singapore (0.04)
- India (0.04)
- Europe
- Northern Europe (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Kosovo (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Eastern Europe (0.04)
- Western Europe (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Italy (0.04)
- France
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Marseille (0.04)
- Île-de-France > Paris
- Paris (0.04)
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Oxfordshire > Oxford (0.04)
- Netherlands (0.04)
- Germany (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- Central America (0.04)
- Dominican Republic (0.04)
- United States
- California > Alameda County
- Berkeley (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Idaho (0.04)
- New York > New York County
- New York City (0.04)
- Ohio > Butler County
- Oxford (0.14)
- Virginia > Alexandria County
- Alexandria (0.04)
- California > Alameda County
- Canada > Ontario
- Oceania
- Australia (0.04)
- Micronesia (0.04)
- New Zealand (0.04)
- Pacific Ocean > North Pacific Ocean
- San Francisco Bay > Golden Gate (0.04)
- South America (0.04)
- Africa
- Genre:
- Research Report
- Experimental Study (0.94)
- New Finding (1.00)
- Research Report
- Industry:
- Banking & Finance > Real Estate (0.68)
- Education > Educational Setting (0.46)
- Government (1.00)
- Health & Medicine (1.00)
- Leisure & Entertainment (1.00)
- Media > News (0.68)
- Technology: