Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Northcutt, Curtis G., Athalye, Anish, Mueller, Jonas

Apr-8-2021–arXiv.org Artificial Intelligence

We algorithmically identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.4% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set. Putative label errors are found using confident learning and then human-validated via crowdsourcing (54% of the algorithmically-flagged candidates are indeed erroneously labeled). Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by 5%. Traditionally, ML practitioners choose which model to deploy based on test accuracy -- our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets.

accuracy, dataset, label error, (15 more...)

arXiv.org Artificial Intelligence

Apr-8-2021

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - California (0.04)
    - Oregon > Multnomah County
      - Portland (0.04)
    - New York > New York County
      - New York City (0.04)
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada > Ontario
    - Toronto (0.14)
- Europe > Spain
  - Canary Islands (0.04)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Education (0.46)

Technology:
- Information Technology
  - Communications > Social Media
    - Crowdsourcing (0.67)
  - Artificial Intelligence > Machine Learning
    - Neural Networks > Deep Learning (0.33)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found