Probably Approximately Correct Labels

Candès, Emmanuel J., Ilyas, Andrew, Zrnic, Tijana

arXiv.org Machine Learning 

A key ingredient in machine learning and statistical pipelines alike is the availability of large amounts of high-quality labeled data. Breakthroughs in computer vision stem from the collection of millions of labeled images [8]; social science research relies on extensively labeled datasets to understand human behavior and opinions [22]. While acquiring unlabeled data (e.g., raw images or texts from the internet) can be relatively inexpensive, acquiring high-quality labels is typically an endeavor that requires significant time and effort from human experts. Given the expense of collecting high-quality labels, an enticing prospect is to use increasingly powerful AI models to predict labels for datasets, bypassing the need for human experts entirely. Indeed, recent works have demonstrated AI models' ability to predict protein structures [17], to evaluate language model responses [39], and even to simulate human experimental subjects [23]. These advances highlight the potential for AI to streamline data annotation, and to produce high-quality labels at a fraction of the cost of human experts. The problem with such an approach is that AI models are not always correct, and in particular come with no guarantees on how well they will label a given dataset. This makes it untenable to use AI-predicted labels as a direct substitute for human labels, particularly in settings where label quality is critical--for instance, in high-stakes applications like medical diagnosis, or when the downstream task is to draw conclusions that inform policy decisions. Motivated by this state of affairs, in this paper we ask: Can we leverage powerful AI models to label data, while still guaranteeing quality?

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found