AtP*: An efficient and scalable method for localizing LLM behaviour to components

Kramár, János, Lieberum, Tom, Shah, Rohin, Nanda, Neel

Mar-1-2024–arXiv.org Artificial Intelligence

As LLMs become ubiquitous and integrated into numerous digital applications, it's an increasingly pressing research problem to understand the internal mechanisms that underlie their behaviour - this is the problem of mechanistic interpretability. A fundamental subproblem is to causally attribute particular behaviours to individual parts of the transformer forward pass, corresponding to specific components (such as attention heads, neurons, layer contributions, or residual streams), often at specific positions in the input token sequence. This is important because in numerous case studies of complex behaviours, they are found to be driven by sparse subgraphs within the model (Meng et al., 2023; Olsson et al., 2022; Wang et al., 2022). A classic form of causal attribution uses zero-ablation, or knock-out, where a component is deleted and we see if this negatively affects a model's output - a negative effect implies the component was causally important. More recent work has generalised this to replacing a component's activations with samples from some baseline distribution (with zero-ablation being a special case where activations are resampled to be zero). We focus on the popular and widely used method of Activation Patching (also known as causal mediation analysis) (Chan et al., 2022; Geiger et al., 2022; Meng et al., 2023) where the baseline distribution is a component's activations on some corrupted input, such as an alternate string with a different answer (Pearl, 2001; Robins and Greenland, 1992). Given a causal attribution method, it is common to sweep across all model components, directly evaluating the effect of intervening on each of them via resampling (Meng et al., 2023). However, when working with SoTA models it can be expensive to attribute behaviour especially to small components (e.g.

large language model, machine learning, node, (21 more...)

arXiv.org Artificial Intelligence

Mar-1-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Greenland (0.24)
  - United States > Hawaii (0.14)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Law (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found