Discovering Variable Binding Circuitry with Desiderata

Davies, Xander, Nadeau, Max, Prakash, Nikhil, Shaham, Tamar Rott, Bau, David

Jul-7-2023–arXiv.org Artificial Intelligence

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Jul-7-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report (0.65)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.70)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found