Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data
Rolf, Esther, Worledge, Theodora, Recht, Benjamin, Jordan, Michael I.
Datasets play a critical role in shaping the perception of performance and progress in machine learning (ML)--the way we collect, process, and analyze data affects the way we benchmark success and form new research agendas (Paullada et al., 2020; Dotan & Milli, 2020). A growing appreciation of this determinative role of datasets has sparked a concomitant concern that standard datasets used for training and evaluating ML models lack diversity along significant dimensions, for example, geography, gender, and skin type (Shankar et al., 2017; Buolamwini & Gebru, 2018). Lack of diversity in evaluation data can obfuscate disparate performance when evaluating based on aggregate accuracy (Buolamwini & Gebru, 2018). Lack of diversity in training data can limit the extent to which learned models can adequately apply to all portions of a population, a concern highlighted in recent work in the medical domain (Habib et al., 2019; Hofmanninger et al., 2020). Our work aims to develop a general unifying perspective on the way that dataset composition affects outcomes of machine learning systems.
Mar-4-2021
- Country:
- North America
- Canada > Ontario
- Toronto (0.14)
- United States > California (0.14)
- Canada > Ontario
- North America
- Genre:
- Research Report > New Finding (0.68)
- Industry: