Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective
Schwarzschild, Avi, Cembalest, Max, Rao, Karthik, Hines, Keegan, Dickerson, John
–arXiv.org Artificial Intelligence
As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.
arXiv.org Artificial Intelligence
Mar-23-2023
- Country:
- North America > United States
- California (0.05)
- New York
- Richmond County > New York City (0.04)
- Queens County > New York City (0.04)
- New York County > New York City (0.04)
- Kings County > New York City (0.04)
- Bronx County > New York City (0.04)
- Maryland > Prince George's County
- College Park (0.04)
- North America > United States
- Genre:
- Research Report (1.00)
- Technology: