Structural Inference: Interpreting Small Language Models with Susceptibilities

Baker, Garrett, Wang, George, Hoogland, Jesse, Murfet, Daniel

May-22-2025–arXiv.org Artificial Intelligence

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

machine learning, natural language, susceptibility, (18 more...)

arXiv.org Artificial Intelligence

May-22-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.67)

Genre:
- Research Report (0.82)

Industry:
- Law (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found