Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes

Marks, Jacob, Griffin, Brent A., Corso, Jason J.

Dec-3-2024–arXiv.org Artificial Intelligence

We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes. This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels. We define reconstruction error ratios (RERs) that probe classification difficulty and allow its decomposition into (1) finite sample size and (2) Bayes error and decision-boundary complexity. Through systematic study across 19 popular visual datasets, we find that our RER-based dataset difficulty probe strongly correlates with error rate for state-of-the-art (SOTA) classification models. By interpreting sample-level classification difficulty as a label mistakenness score, we further find that RERs achieve SOTA performance on mislabel detection tasks on hard datasets under symmetric and asymmetric label noise. Data is the cornerstone of modern machine learning. As the data-centric AI movement has made increasingly clear, both predictive and generative ML models rely on sufficiently large and diverse high-quality datasets (Deng et al., 2009b; Radford et al., 2018; Kaplan et al., 2020). However, it is well known that even popular visual datasets like CIFAR-100 (Krizhevsky & Hinton, 2009), Caltech-256 (Griffin et al., 2007), and ImageNet (Deng et al., 2009b) can have hundreds or thousands of data quality issues, including up to 10% label errors (Northcutt et al., 2021). Consequently, curating a high-quality dataset requires not only data collection but also data cleaning, characterization, evaluation, and refinement. Nevertheless, existing methods for data quality assessment are inherently limited. Methods that seek to estimate the classification difficulty of a sample or dataset are either model-dependent (Ethayarajh et al., 2021), computationally infeasible (Scheidegger et al., 2021), or break down when applied to challenging datasets (Zhang et al., 2020). Likewise, mislabel detection methods either rely on training a strong classifier on the dataset (Pruthi et al., 2020; Pleiss et al., 2020), which becomes more time and compute-intensive for more complex datasets, or exhibit degraded performance on datasets with complex decision boundaries (Zhu et al., 2021; Northcutt et al., 2021). To address these limitations, we propose a novel approach for characterizing the difficulty of classification datasets by decomposing complex multi-class classification problems into one manifold learning problem for each class.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Dec-3-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - California (0.04)
    - Michigan (0.04)
    - New York > New York County
      - New York City (0.04)
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Colorado > El Paso County
      - Colorado Springs (0.04)
  - Canada > Ontario
    - Toronto (0.14)
- Europe > United Kingdom
  - England > Oxfordshire > Oxford (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Education (0.68)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning (1.00)
  - Performance Analysis > Accuracy (1.00)
  - Neural Networks (1.00)