Training on Plausible Counterfactuals Removes Spurious Correlations

Sadiku, Shpresim, Chitranshi, Kartikeya, Kera, Hiroshi, Pokutta, Sebastian

arXiv.org Artificial Intelligence 

Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found