Goto

Collaborating Authors

 mechanistic interpretability



Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Neural Information Processing Systems

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We use a psychophysical paradigm to quantify one form of mechanistic interpretability for a diverse suite of nine models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago.


Compact Proofs of Model Performance via Mechanistic Interpretability

Neural Information Processing Systems

We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance.We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of-$K$, validating proof transferability across 151 random seeds and four values of $K$.We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models.Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding.Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds.We confirm these connections by qualitatively examining a subset of our proofs.Finally, we identify compounding structureless errors as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.


Towards Automated Circuit Discovery for Mechanistic Interpretability

Neural Information Processing Systems

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors oftransformer models. This paper systematizes the mechanistic interpretability process they followed. First, researcherschoose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find whichabstract neural network units are involved in the behavior. By varying the dataset, metric, and units underinvestigation, researchers can understand the functionality of each component.We automate one of the process' steps: finding the connections between the abstract neural network units that form a circuit. We propose several algorithms and reproduce previous interpretability results to validate them. Forexample, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes theGreater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found byprevious work.


Mechanistic Interpretability of Antibody Language Models Using SAEs

Haque, Rebonto, Turnbull, Oliver M., Parsan, Anisha, Parsan, Nithin, Yang, John J., Deane, Charlotte M.

arXiv.org Artificial Intelligence

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.


Sparse Attention Post-Training for Mechanistic Interpretability

Draye, Florent, Lei, Anson, Posner, Ingmar, Schölkopf, Bernhard

arXiv.org Artificial Intelligence

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.3 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.


Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective

Lee, Jae Hee, Lauscher, Anne, Albrecht, Stefano V.

arXiv.org Artificial Intelligence

Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.


Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations

Mitra, Chancharik, Luo, Yusen, Saravanan, Raj, Niu, Dantong, Pai, Anirudh, Thomason, Jesse, Darrell, Trevor, Anwar, Abrar, Ramanan, Deva, Herzig, Roei

arXiv.org Artificial Intelligence

Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics. Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task. In this work, we introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks. Through comprehensive on-robot evaluations with a Franka Emika robot arm, we demonstrate that Robotic Steering outperforms LoRA while achieving superior robustness under task variation, reduced computational cost, and enhanced interpretability for adapting VLAs to diverse robotic tasks.


Mechanistic Interpretability for Transformer-based Time Series Classification

Kalnāre, Matīss, Kitharidis, Sofoklis, Bäck, Thomas, van Stein, Niki

arXiv.org Artificial Intelligence

Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoen-coders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.


Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks

Kowalska, Bianka, Kwaśnicka, Halina

arXiv.org Artificial Intelligence

Artificial intelligence (AI) is increasingly assisting us in a wide range of tasks, from everyday applications like recommendation systems to high-risk domains such as bio-metric recognition, autonomous vehicles, and medical diagnosis [1]. In particular, the rise of transformer-based models, such as those used in natural language processing (NLP), has significantly accelerated AI's adoption and visibility in society, enabling breakthroughs in fields like text generation, translation, and image understanding [2]. The size, complexity, and opacity of deep learning models are growing exponentially, further outpacing the ability of researchers to understand the black box. As deep neural networks are increasingly deployed in real-world applications with more advanced use cases, the impact of AI continues to grow. This growing influence, coupled with the often opaque, black-box nature of most AI systems, has led to a heightened demand for AI models that are both faithful and explainable. The validation of AI's decisions is especially critical in high-risks areas, such as law or medicine [3, 4]. As a result, Explainable AI (XAI) emerged as a direct response to companies' and researchers' demands to interpret, explain and validate neural networks to make AI systems trustworthy. XAI encompasses all methods, approaches and efforts to uncover the reasoning and behavior of artificial intelligence systems [1]. Thus, it is important to establish an understanding of common terms used in the XAI literature, despite the lack of universally accepted definitions.