Goto

Collaborating Authors

 attribution


Identifying interactions at scale for LLMs

AIHub

Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process more transparent to model builders and impacted humans, a step toward safer and more trustworthy AI. To achieve state-of-the-art performance, models synthesize complex feature relationships, find shared patterns from diverse training examples, and process information through highly interconnected internal components. In this blog post, we describe the fundamental ideas behind SPEX and ProxySPEX, algorithms capable of identifying these critical interactions at scale. We mask or remove specific segments of the input prompt and measure the resulting shift in the predictions.


Path-Sampled Integrated Gradients

Kamalov, Firuz, Thabtah, Fadi, Sivaraj, R., Abdelhamid, Neda

arXiv.org Machine Learning

We introduce path-sampled integrated gradients (PS-IG), a framework that generalizes feature attribution by computing the expected value over baselines sampled along the linear interpolation path. We prove that PS-IG is mathematically equivalent to path-weighted integrated gradients, provided the weighting function matches the cumulative distribution function of the sampling density. This equivalence allows the stochastic expectation to be evaluated via a deterministic Riemann sum, improving the error convergence rate from $O(m^{-1/2})$ to $O(m^{-1})$ for smooth models. Furthermore, we demonstrate analytically that PS-IG functions as a variance-reducing filter against gradient noise - strictly lowering attribution variance by a factor of 1/3 under uniform sampling - while preserving key axiomatic properties such as linearity and implementation invariance.

  Country:
  Genre: Research Report (0.40)

Regional Explanations: Bridging Local and Global Variable Importance

Amoukou, Salim I., Brunel, Nicolas J-B.

arXiv.org Machine Learning

We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value $x_i$ to a specific prediction $f(x_1, \dots, x_p)$. Despite their widespread use, we identify fundamental limitations in their ability to reliably detect locally important features, even under ideal conditions with exact computations and independent features. We argue that a sound local attribution method should not assign importance to features that neither influence the model output (e.g., features with zero coefficients in a linear model) nor exhibit statistical dependence with functionality-relevant features. We demonstrate that both Local SV and LIME violate this fundamental principle. To address this, we propose R-LOCO (Regional Leave Out COvariates), which bridges the gap between local and global explanations and provides more accurate attributions. R-LOCO segments the input space into regions with similar feature importance characteristics. It then applies global attribution methods within these regions, deriving an instance's feature contributions from its regional membership. This approach delivers more faithful local attributions while avoiding local explanation instability and preserving instance-specific detail often lost in global methods.


A Bayesian Information-Theoretic Approach to Data Attribution

Tailor, Dharmesh, Felicioni, Nicolò, Ciosek, Kamil

arXiv.org Machine Learning

Training Data Attribution (TDA) seeks to trace model predictions back to influential training examples, enhancing interpretability and safety. We formulate TDA as a Bayesian information-theoretic problem: subsets are scored by the information loss they induce - the entropy increase at a query when removed. This criterion credits examples for resolving predictive uncertainty rather than label noise. To scale to modern networks, we approximate information loss using a Gaussian Process surrogate built from tangent features. We show this aligns with classical influence scores for single-example attribution while promoting diversity for subsets. For even larger-scale retrieval, we relax to an information-gain objective and add a variance correction for scalable attribution in vector databases. Experiments show competitive performance on counterfactual sensitivity, ground-truth retrieval and coreset selection, showing that our method scales to modern architectures while bridging principled measures with practice.


Post-hoc Self-explanation of CNNs

Boubekki, Ahcène, Clemmensen, Line H.

arXiv.org Machine Learning

Although standard Convolutional Neural Networks (CNNs) can be mathematically reinterpreted as Self-Explainable Models (SEMs), their built-in prototypes do not on their own accurately represent the data. Replacing the final linear layer with a $k$-means-based classifier addresses this limitation without compromising performance. This work introduces a common formalization of $k$-means-based post-hoc explanations for the classifier, the encoder's final output (B4), and combinations of intermediate feature activations. The latter approach leverages the spatial consistency of convolutional receptive fields to generate concept-based explanation maps, which are supported by gradient-free feature attribution maps. Empirical evaluation with a ResNet34 shows that using shallower, less compressed feature activations, such as those from the last three blocks (B234), results in a trade-off between semantic fidelity and a slight reduction in predictive performance.



GAIA: Delving into Gradient-based Attribution Abnormality for Out-of-distribution Detection---Supplementary Material- -- A Extensive Experiments A.1 Computational Efficiency of GAIA Methods

Neural Information Processing Systems

In Tab. 1, we conduct the test on a Tesla V100 to In Tab. 2, we train five ResNet34 models for the CIFAR benchmarks (CIFAR10 and CIFAR100), The blocks, labeled as block1 to block5, correspond to the output features obtained from shallow to deep. This can be expained as the model's In Section 4.1, we introduce channel-wise average abnormality under the assumption that Gradient-based Class Activation Mapping (GradCAM) can be regarded as having only first-order independent Here we provide a proof (from [18]) for this assumption. Then based on Eq. 2, we The issue of attribution can be viewed as the assignment of credit in cooperative game theory. Null Player Axiom: If removal of a feature across all potential coalitions with other features has no impact on the output, it should be assigned zero importance. In Section 4.2, we introduce the two-stage fusion strategy for GAIA-A and in Section 5.3, we briefly Eq. 8, the effect of output component is similar to the The extensive results are shown in Tab. 3. It indicates the effectiveness of our fusion strategy.