mislabeled example
Calibration improves detection of mislabeled examples
Chibane, Ilies, George, Thomas, Nodet, Pierre, Lemaire, Vincent
Mislabeled data is a pervasive issue that undermines the performance of machine learning systems in real-world applications. An effective approach to mitigate this problem is to detect mislabeled instances and subject them to special treatment, such as filtering or relabeling. Automatic mislabeling detection methods typically rely on training a base machine learning model and then probing it for each instance to obtain a trust score that each provided label is genuine or incorrect. The properties of this base model are thus of paramount importance. In this paper, we investigate the impact of calibrating this model. Our empirical results show that using calibration methods improves the accuracy and robustness of mislabeled instance detection, providing a practical and effective solution for industrial applications.
Characterizing Datapoints via Second-Split Forgetting Supplementary Material A Theoretical Results A.1 Preliminaries Let w 2 R
We assume the sample complexity required to estimate the distribution as a proxy for the complexity of the distribution. We make these assumptions to simplify the theoretical exposition. However, our results can be observed even after relaxing them at the expense of more book-keeping. Based on Chatterji and Long [ 9 ], we make the following assumptions about the problem setup: (A.1) The labels are reversed for mislabeled examples.
An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets
Srikanth, Maya, Irvin, Jeremy, Hill, Brian Wesley, Godoy, Felipe, Sabane, Ishan, Ng, Andrew Y.
Major advancements in computer vision can primarily be attributed to the use of labeled datasets. However, acquiring labels for datasets often results in errors which can harm model performance. Recent works have proposed methods to automatically identify mislabeled images, but developing strategies to effectively implement them in real world datasets has been sparsely explored. Towards improved data-centric methods for cleaning real world vision datasets, we first conduct more than 200 experiments carefully benchmarking recently developed automated mislabel detection methods on multiple datasets under a variety of synthetic and real noise settings with varying noise levels. We compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that we craft, and find that SEMD performs similarly to or outperforms prior mislabel detection approaches. We then apply SEMD to multiple real world computer vision datasets and test how dataset size, mislabel removal strategy, and mislabel removal amount further affect model performance after retraining on the cleaned data. With careful design of the approach, we find that mislabel removal leads per-class performance improvements of up to 8% of a retrained classifier in smaller data regimes.