Attributions All the Way Down? The Metagame of Interpretability

Baniecki, Hubert, Biecek, Przemyslaw, Fumagalli, Fabian

May-8-2026–arXiv.org Machine Learning

We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $ϕ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $φ_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.

large language model, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

May-8-2026

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (1.00)
- Leisure & Entertainment (0.93)
- Government (0.92)

Technology:
- Information Technology
  - Game Theory (0.90)
  - Artificial Intelligence
    - Vision (0.93)
    - Natural Language > Large Language Model (0.93)
    - Machine Learning
      - Neural Networks > Deep Learning (1.00)
      - Statistical Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found