How many labelers do you have? A closer look at gold-standard labels

Cheng, Chen, Asi, Hilal, Duchi, John

arXiv.org Artificial Intelligence 

The centrality of data collection to the development of statistical machine learning is evident [12], with numerous challenge datasets driving advances [27, 25, 1, 22, 11, 37, 38]. Essential to these is the collection of labeled data. While in the past, experts could provide reliable labels for reasonably sized datasets, the cost and size of modern datasets often precludes this expert annotation, motivating a growing literature on crowdsourcing and other sophisticated dataset generation strategies that aggregate expert and non-expert feedback or collect internet-based loosely supervised and multimodal data [10, 20, 48, 37, 34, 38, 13]. By aggregating multiple labels, one typically hopes to obtain clean, true, "gold-standard" data. Yet most statistical machine learning development--theoretical or methodological--does not investigate this full data generating process, assuming only that data comes in the form of (X, Y) pairs of covariates X and targets (labels) Y [45, 5, 2, 17]. Here, we argue for a more holistic perspective: broadly, that analysis and algorithmic development should focus on the more complete machine learning pipeline, from dataset construction to model output; and more narrowly, questioning such aggregation strategies and the extent to which such cleaned data is essential or even useful. To that end, we develop a stylized theoretical model to capture uncertainties in the labeling process, allowing us to understand the contrasts, limitations and possible improvements of using aggregated or non-aggregated data in a statistical learning pipeline.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found