Reviews: A Benchmark for Interpretability Methods in Deep Neural Networks

Neural Information Processing Systems 

Summary --- This paper proposes to evaluate saliency/importance visual explanations by removing "important" pixels and measuring whether a re-trained classifier can still classify such images correctly. Many explanations fail to remove such class-relevant information, but some ensembling techniques succeed by completely removing objects. Those are said to be better explanations. This paper takes the view that important information is that information which a classifier can use to predict the correct label. As a result, we can measure whether an importance estimate is good by measuring how much performance drops when the important pixels are removed from all images in both train and val sets.