Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Kim, Yearim, Han, Sangyu, Han, Sangbum, Kwak, Nojun

Sep-3-2024–arXiv.org Artificial Intelligence

In the field of eXplainable AI (XAI) in language models, the progression from local explanations of individual decisions to global explanations with high-level concepts has laid the groundwork for mechanistic interpretability, which aims to decode the exact operations. However, this paradigm has not been adequately explored in image models, where existing methods have primarily focused on classspecific interpretations. This paper introduces a novel approach to systematically trace the entire pathway from input through all intermediate layers to the final output within the whole dataset. We utilize Pointwise Feature Vectors (PFVs) and Effective Receptive Fields (ERFs) to decompose model embeddings into interpretable Concept Vectors. Then, we calculate the relevance between concept vectors with our Generalized Integrated Gradients (GIG), enabling a comprehensive, dataset-wide analysis of model behavior. In the field of eXplainable AI (XAI), efforts have historically transitioned from Local explanation to Global explanation to Mechanistic Interpretability. While local explanation methods including Selvaraju et al. (2016); Montavon et al. (2017); Sundararajan et al. (2017); Han et al. (2024) have focused on explaining specific decisions for individual instances, global explanation methods seek to uncover overall patterns and behaviors applicable across the entire dataset (Wu et al., 2022; Xuanyuan et al., 2023; Singh et al., 2024).

attribution, concept vector, explanation, (13 more...)

arXiv.org Artificial Intelligence

Sep-3-2024

arXiv.org PDF

Add feedback

Country:
- Asia > South Korea > Seoul > Seoul (0.04)

Genre:
- Research Report > Promising Solution (0.48)

Technology:
- Information Technology
  - Data Science > Data Mining (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Machine Learning
      - Neural Networks (1.00)
      - Statistical Learning > Clustering (0.46)
      - Supervised Learning > Representation Of Examples (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found