Quantifying and mitigating the impact of label errors on model disparity metrics
Adebayo, Julius, Hall, Melissa, Yu, Bowen, Chern, Bobbie
Errors in labels obtained via human annotation adversely affect a model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's disparity metrics We empirically characterize how varying levels of label error, in both training and test data, affect these disparity metrics. We find that group calibration and other metrics are sensitive to train-time and test-time label error--particularly for minority groups. This disparate effect persists even for models trained with noise-aware algorithms. To mitigate the impact of training-time label error, we present an approach to estimate the influence of a training input's label on a model's group disparity metric. We empirically assess the proposed approach on a variety of datasets and find significant improvement, compared to alternative approaches, in identifying training inputs that improve a model's disparity metric. We complement the approach with an automatic relabel-and-finetune scheme that produces updated models with, provably, improved group calibration error. Label error (noise) -- mistakes associated with the label assigned to a data point -- is a pervasive problem in machine learning (Northcutt et al., 2021). For example, 30 percent of a random 1000 samples from the Google Emotions dataset (Demszky et al., 2020) had label errors (Chen, 2022). Similarly, an analysis of the MS COCO dataset found that up to 37 percent (273,834 errors) of all annotations are erroneous (Murdoch, 2022). Yet, little is known about the effect of label error on a model's group-based disparity metrics like equal odds (Hardt et al., 2016), group calibration (Pleiss et al., 2017), and false positive rate (Barocas et al., 2019). It is now common practice to conduct'fairness' audits (see: (Buolamwini and Gebru, 2018; Raji and Buolamwini, 2019; Bakalar et al., 2021)) of a model's predictions to identify data subgroups where the model underperforms. Label error in the test data used to conduct a fairness audit renders the results unreliable. Similarly, label error in the training data, especially if the error is systematically more prevalent in certain groups, can lead to models that associate erroneous labels to such groups. The reliability of a fairness audit rests on the assumption that labels are accurate; yet, the sensitivity of a model's disparity metrics to label error is still poorly understood. Towards such end, we ask: what is the effect of label error on a model's disparity metric? We address the high-level question in a two-pronged manner via the following questions: 1. Research Question 1: What is the sensitivity of a model's disparity metric to label errors in training and test data? Does the effect of label error vary based on group size? 2. Research Question 2: How can a practitioner identify training points whose labels have the most influence on a model's group disparity metric?
Oct-3-2023
- Country:
- North America > United States (0.28)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Banking & Finance (0.67)
- Health & Medicine (0.46)
- Technology: