A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Longpre, Shayne, Yauney, Gregory, Reif, Emily, Lee, Katherine, Roberts, Adam, Zoph, Barret, Zhou, Denny, Wei, Jason, Robinson, Kevin, Mimno, David, Ippolito, Daphne

arXiv.org Artificial Intelligence 

The strong performance (Chowdhery et al., 2022; Nostalgebraist, 2022; OpenAI, 2023; Google, 2023), and emergent abilities (Wei et al., 2022) of modern language models (LMs) depend on self-supervised pretraining on massive text datasets. All model developers implicitly or explicitly decide the composition of these datasets: what data sources to include, whether to filter for attributes such as quality and toxicity, and when to gather new documents. While many of the most prominent models do not document their curation procedures (OpenAI, 2023; Google, 2023), or only document which procedures they used (Brown et al., 2020; Nostalgebraist, 2022; Scao et al., 2022; Touvron et al., 2023), they rarely document why they chose those protocols or what effect they had. This documentation debt leaves practitioners to be guided by intuitions and precedents, neither thoroughly evaluated (Bandy and Vincent, 2021; Sambasivan et al., 2021). Given the outsized and fundamental role of pretraining data in modern LMs, we believe this neglectful practice has detracted from responsible data use and hampered effective model development (Rogers, 2021; Gebru et al., 2021; Bender and Friedman, 2018). Among the small number of general-purpose LMs dominating community use and discussion, the prevailing focus has been on the scale of pretraining data and number of optimization steps (Brown et al., 2020; Nostalgebraist, 2022; Google, 2023). In this work, we systematically test how common data design decisions affect model performance--specifically: the time of collection, content filtering strategy (toxicity/quality), and domain composition. We study the impacts in two ways.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found