generative causal explanation
Generative causal explanations of black-box classifiers
We develop a method for generating causal post-hoc explanations of black-box classifiers based on a learned low-dimensional representation of the data. The explanation is causal in the sense that changing learned latent factors produces a change in the classifier output statistics. To construct these explanations, we design a learning framework that leverages a generative model and information-theoretic measures of causal influence. Our objective function encourages both the generative model to faithfully represent the data distribution and the latent factors to have a large causal influence on the classifier output. Our method learns both global and local explanations, is compatible with any classifier that admits class probabilities and a gradient, and does not require labeled attributes or knowledge of causal structure. Using carefully controlled test cases, we provide intuition that illuminates the function of our causal objective. We then demonstrate the practical utility of our method on image recognition tasks.
Review for NeurIPS paper: Generative causal explanations of black-box classifiers
Clarity: The paper is very well written and is a pleasure to read. I have, however, a few concerns about the choice of the overall paper structure and certain presentation decisions. I must first mention that they are highly subjective and may be partially a matter of taste. As such, they did not strongly impact my assessment, but I want to mention it to a) let authors know that there may be an issue b) communicate these considerations to other reviewers. The first concern is the author's decision of centering the paper presentation around the theme of causal explanations, while in the model that the authors consider, the causal part is equivalent to maximizing mutual information between a subset of latent features and the classifier decision.
Review for NeurIPS paper: Generative causal explanations of black-box classifiers
This paper presents a generative model to "explain" any given black-box classifier and its training dataset. Explanation is through a hidden factor that can control or intervene in the output of the classifier. The discovery is based on a objective with two terms: 1) a proposed Information Flow that denotes the causal effect from the hidden factor to the classifier output and 2) a distribution similarity to impose the discovered hidden factor can generate back the feature space. Reviewers found this a borderline paper. After the discussion phase all reviewers are leaning towards acceptance. They pointed out as strengths that this is a very well-written paper, presenting a simple yet effective method, with extensive ablative experiments.
Generative causal explanations of black-box classifiers
We develop a method for generating causal post-hoc explanations of black-box classifiers based on a learned low-dimensional representation of the data. The explanation is causal in the sense that changing learned latent factors produces a change in the classifier output statistics. To construct these explanations, we design a learning framework that leverages a generative model and information-theoretic measures of causal influence. Our objective function encourages both the generative model to faithfully represent the data distribution and the latent factors to have a large causal influence on the classifier output. Our method learns both global and local explanations, is compatible with any classifier that admits class probabilities and a gradient, and does not require labeled attributes or knowledge of causal structure.