Probably Approximately Correct Labels

Candès, Emmanuel J., Ilyas, Andrew, Zrnic, Tijana

Jun-13-2025–arXiv.org Machine Learning

A key ingredient in machine learning and statistical pipelines alike is the availability of large amounts of high-quality labeled data. Breakthroughs in computer vision stem from the collection of millions of labeled images [8]; social science research relies on extensively labeled datasets to understand human behavior and opinions [22]. While acquiring unlabeled data (e.g., raw images or texts from the internet) can be relatively inexpensive, acquiring high-quality labels is typically an endeavor that requires significant time and effort from human experts. Given the expense of collecting high-quality labels, an enticing prospect is to use increasingly powerful AI models to predict labels for datasets, bypassing the need for human experts entirely. Indeed, recent works have demonstrated AI models' ability to predict protein structures [17], to evaluate language model responses [39], and even to simulate human experimental subjects [23]. These advances highlight the potential for AI to streamline data annotation, and to produce high-quality labels at a fraction of the cost of human experts. The problem with such an approach is that AI models are not always correct, and in particular come with no guarantees on how well they will label a given dataset. This makes it untenable to use AI-predicted labels as a direct substitute for human labels, particularly in settings where label quality is critical--for instance, in high-stakes applications like medical diagnosis, or when the downstream task is to draw conclusions that inform policy decisions. Motivated by this state of affairs, in this paper we ask: Can we leverage powerful AI models to label data, while still guaranteeing quality?

large language model, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

Jun-13-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Wisconsin > Dane County
    - Madison (0.04)
  - New York > New York County
    - New York City (0.04)
  - New Jersey > Mercer County
    - Princeton (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.40)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.47)
  - Machine Learning
    - Statistical Learning (0.46)
    - Unsupervised or Indirectly Supervised Learning (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found