Localizing Model Behavior with Path Patching

Goldowsky-Dill, Nicholas, MacLeod, Chris, Sato, Lucas, Arora, Aryaman

May-16-2023–arXiv.org Artificial Intelligence

Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

May-16-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found