Uncovering Intermediate Variables in Transformers using Circuit Probing
Lepori, Michael A., Serre, Thomas, Pavlick, Ellie
–arXiv.org Artificial Intelligence
Probing involves learning a classifier to decode information about a hypothesized intermediate variable from model activations (Tenney et al., 2019; Hewitt & Manning, 2019; Ettinger, 2020; Li et al., 2022; Nanda et al., 2023). Prior work has demonstrated that probing oftentimes mischaracterizes the underlying computations performed by the model (Hewitt & Liang, 2019; Zhang & Bowman, 2018; Voita & Titov, 2020)). Among other methods (Ravfogel et al., 2020; 2022; Belrose et al., 2023), counterfactual embeddings were introduced to fix this problem, enabling causal analysis using probes (Tucker et al., 2021). We empirically demonstrate that circuit probing is more faithful to a model's computation than linear or nonlinear probing (See Sections 3.2, 3.3), and that counterfactual embeddings do not always solve this problem (See Sections 3.1, 3.2). Causal abstraction analysis intervenes on the activation vectors produced by Transformer layers to localize intermediate variables to particular vector subspaces (Geiger et al., 2021; 2023; Wu et al., 2023). However, causal abstraction analysis requires hypothesizing a complete causal graph - a high-level description of how inputs are mapped to predictions, including all interactions between intermediate variables. This is impossible for most real-world tasks on which we want to apply neural networks (such as language modeling). We empirically demonstrate the utility of recent causal abstraction analysis techniques when full causal graphs are available, and show that circuit probing arrives at the same results (See Sections 3.1, 3.2).
arXiv.org Artificial Intelligence
Nov-17-2023
- Country:
- North America > United States (0.28)
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Health & Medicine > Therapeutic Area
- Neurology (0.46)
- Transportation (0.46)
- Health & Medicine > Therapeutic Area
- Technology: