Structural Inference: Interpreting Small Language Models with Susceptibilities
Baker, Garrett, Wang, George, Hoogland, Jesse, Murfet, Daniel
–arXiv.org Artificial Intelligence
We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.
arXiv.org Artificial Intelligence
May-22-2025
- Country:
- Asia > Middle East
- UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Europe
- Spain (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- North America > United States
- California (0.04)
- South America > Peru (0.04)
- Asia > Middle East
- Genre:
- Research Report (0.82)
- Industry:
- Law (1.00)
- Technology: