faithful explanation
EXP-CAM: Explanation Generation and Circuit Discovery Using Classifier Activation Matching
Suhail, Pirzada, Anand, Aditya, Sethi, Amit
Machine learning models, by virtue of training, learn a large repertoire of decision rules for any given input, and any one of these may suffice to justify a prediction. However, in high-dimensional input spaces, such rules are difficult to identify and interpret. In this paper, we introduce EXP-CAM: an explanation generation and circuit discovery approach using Classifier Activation Matching. EXP-CAM can generate minimal and faithful explanations for the decisions of pre-trained image classifiers that not only preserve the model's decision but are also concise and human-readable. We aim to identify minimal explanations that not only preserve the model's decision but are also concise and human-readable. To achieve this, we train a lightweight auto-encoder to produce binary masks that learns to highlight the decision-wise critical regions of an image while discarding irrelevant background. The training objective integrates activation alignment across multiple layers, consistency at the output label, priors that encourage sparsity, and compactness, along with a robustness constraint that enforces faithfulness. The minimal explanations so generated also lead us to mechanistically interpreting the model internals. In this regard we also introduce a circuit readout procedure wherein using the explanation's forward pass and gradients, we identify active channels and construct a channel-level graph, scoring inter-layer edges by ingress weight magnitude times source activation and feature-to-class links by classifier weight magnitude times feature activation. Together, these contributions provide a practical bridge between minimal input-level explanations and a mechanistic understanding of the internal computations driving model decisions.
Activation Matching for Explanation Generation
Suhail, Pirzada, Anand, Aditya, Sethi, Amit
In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of \(x\). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.
Models That Are Interpretable But Not Transparent
Zhong, Chudi, Chen, Panyu, Rudin, Cynthia
Faithful explanations are essential for machine learning models in high-stakes applications. Inherently interpretable models are well-suited for these applications because they naturally provide faithful explanations by revealing their decision logic. However, model designers often need to keep these models proprietary to maintain their value. This creates a tension: we need models that are interpretable--allowing human decision-makers to understand and justify predictions, but not transparent, so that the model's decision boundary is not easily replicated by attackers. Shielding the model's decision boundary is particularly challenging alongside the requirement of completely faithful explanations, since such explanations reveal the true logic of the model for an entire subspace around each query point. This work provides an approach, FaithfulDefense, that creates model explanations for logical models that are completely faithful, yet reveal as little as possible about the decision boundary. FaithfulDefense is based on a maximum set cover formulation, and we provide multiple formulations for it, taking advantage of submodularity.
AudioGenX: Explainability on Text-to-Audio Generative Models
Kang, Hyunju, Han, Geonhee, Jeong, Yoonjae, Park, Hogun
Text-to-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs, enhancing both the explainability and trustworthiness of TAG models. Extensive experiments demonstrate the effectiveness of AudioGenX in producing faithful explanations, benchmarked against existing methods using novel evaluation metrics specifically designed for audio generation tasks.
A Sim2Real Approach for Identifying Task-Relevant Properties in Interpretable Machine Learning
Nofshin, Eura, Brown, Esther, Lim, Brian, Pan, Weiwei, Doshi-Velez, Finale
In the context of human+AI interaction, explanations of the underlying function can provide additional information to assist the human in performing their task. Recent literature suggests that explanations with different properties are useful for different tasks [Liao et al., 2022, Lai et al., 2023, Chen et al., 2023, Jesus et al., 2021, Wang et al., 2019, Liao et al., 2020, Lim and Dey, 2009]. For example, in an AI-auditing task, the user may need to check whether the AI inappropriately relied on a forbidden feature, such as using gender in computing a credit score [Kaur et al., 2020, Hase and Bansal, 2020a, Lakkaraju et al., 2019]. In this case, we would want explanations that are faithful; that is, they reliably capture the underlying behavior of the function. On the other hand, suppose our goal is to help a user quickly understand the process by which a function produces its output; we can quantify the user's understanding by measuring the user's ability to approximate the function's output, given the input and an explanation [Hase and Bansal, 2020b, Chandrasekaran et al., 2018]. In this case, we may want explanations with low complexity, so that the user can effectively reason using the explanation in a limited amount of time.
Interpretability Needs a New Paradigm
Madsen, Andreas, Lakkaraju, Himabindu, Reddy, Siva, Chandar, Sarath
Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in artificial intelligence (AI), which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.
Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate
Kim, Kyungha, Lee, Sangyun, Huang, Kung-Hsiang, Chan, Hou Pong, Li, Manling, Ji, Heng
Fact-checking research has extensively explored verification but less so the generation of natural-language explanations, crucial for user trust. While Large Language Models (LLMs) excel in text generation, their capability for producing faithful explanations in fact-checking remains underexamined. Our study investigates LLMs' ability to generate such explanations, finding that zero-shot prompts often result in unfaithfulness. To address these challenges, we propose the Multi-Agent Debate Refinement (MADR) framework, leveraging multiple LLMs as agents with diverse roles in an iterative refining process aimed at enhancing faithfulness in generated explanations. MADR ensures that the final explanation undergoes rigorous validation, significantly reducing the likelihood of unfaithful elements and aligning closely with the provided evidence. Experimental results demonstrate that MADR significantly improves the faithfulness of LLM-generated explanations to the evidence, advancing the credibility and trustworthiness of these explanations.
Large Language Models As Faithful Explainers
Chuang, Yu-Neng, Wang, Guanchu, Chang, Chia-Yuan, Tang, Ruixiang, Yang, Fan, Du, Mengnan, Cai, Xuanting, Hu, Xia
Large Language Models (LLMs) have recently become proficient in addressing complex tasks by utilizing their rich internal knowledge and reasoning ability. Consequently, this complexity hinders traditional input-focused explanation algorithms for explaining the complex decision-making processes of LLMs. Recent advancements have thus emerged for self-explaining their predictions through a single feed-forward inference in a natural language format. However, natural language explanations are often criticized for lack of faithfulness since these explanations may not accurately reflect the decision-making behaviors of the LLMs. In this work, we introduce a generative explanation framework, xLLM, to improve the faithfulness of the explanations provided in natural language formats for LLMs. Specifically, we propose an evaluator to quantify the faithfulness of natural language explanation and enhance the faithfulness by an iterative optimization process of xLLM, with the goal of maximizing the faithfulness scores. Experiments conducted on three NLU datasets demonstrate that xLLM can significantly improve the faithfulness of generated explanations, which are in alignment with the behaviors of LLMs.
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
Agarwal, Chirag, Tanneru, Sree Harsha, Lakkaraju, Himabindu
Large Language Models (LLMs) are deployed as powerful tools for several natural language processing (NLP) applications. Recent works show that modern LLMs can generate self-explanations (SEs), which elicit their intermediate reasoning steps for explaining their behavior. Self-explanations have seen widespread adoption owing to their conversational and plausible nature. However, there is little to no understanding of their faithfulness. In this work, we discuss the dichotomy between faithfulness and plausibility in SEs generated by LLMs. We argue that while LLMs are adept at generating plausible explanations -- seemingly logical and coherent to human users -- these explanations do not necessarily align with the reasoning processes of the LLMs, raising concerns about their faithfulness. We highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness. We assert that the faithfulness of explanations is critical in LLMs employed for high-stakes decision-making. Moreover, we urge the community to identify the faithfulness requirements of real-world applications and ensure explanations meet those needs. Finally, we propose some directions for future work, emphasizing the need for novel methodologies and frameworks that can enhance the faithfulness of self-explanations without compromising their plausibility, essential for the transparent deployment of LLMs in diverse high-stakes domains.
How AlphaZero Learns Chess?
DeepMind and Google Brain researchers and former World Chess Champion Vladimir Kramnik explore how human knowledge is acquired and how chess concepts are represented in the AlphaZero neural network via concept probing, behavioral analysis, and an examination of its activations. The world has quietly crowned a new chess champion. While it has now been over two decades since a human has been honored with that title, the latest victor represents a breakthrough in another significant way: It's an algorithm that can be generalized to other learning tasks. AlphaZero, the new reigning champion, acquired all its chess know-how in a mere four hours. AlphaZero is almost as different from its fellow AI chess competitors as Deep Blue was from Gary Kasparov, back when the latter first faced off against a supercomputer in 1996.