Uncovering Intermediate Variables in Transformers using Circuit Probing

Lepori, Michael A., Serre, Thomas, Pavlick, Ellie

Nov-17-2023–arXiv.org Artificial Intelligence

Probing involves learning a classifier to decode information about a hypothesized intermediate variable from model activations (Tenney et al., 2019; Hewitt & Manning, 2019; Ettinger, 2020; Li et al., 2022; Nanda et al., 2023). Prior work has demonstrated that probing oftentimes mischaracterizes the underlying computations performed by the model (Hewitt & Liang, 2019; Zhang & Bowman, 2018; Voita & Titov, 2020)). Among other methods (Ravfogel et al., 2020; 2022; Belrose et al., 2023), counterfactual embeddings were introduced to fix this problem, enabling causal analysis using probes (Tucker et al., 2021). We empirically demonstrate that circuit probing is more faithful to a model's computation than linear or nonlinear probing (See Sections 3.2, 3.3), and that counterfactual embeddings do not always solve this problem (See Sections 3.1, 3.2). Causal abstraction analysis intervenes on the activation vectors produced by Transformer layers to localize intermediate variables to particular vector subspaces (Geiger et al., 2021; 2023; Wu et al., 2023). However, causal abstraction analysis requires hypothesizing a complete causal graph - a high-level description of how inputs are mapped to predictions, including all interactions between intermediate variables. This is impossible for most real-world tasks on which we want to apply neural networks (such as language modeling). We empirically demonstrate the utility of recent causal abstraction analysis techniques when full causal graphs are available, and show that circuit probing arrives at the same results (See Sections 3.1, 3.2).

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-17-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Health & Medicine > Therapeutic Area
  - Neurology (0.46)
- Transportation (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.98)
  - Natural Language (1.00)