Towards Automated Circuit Discovery for Mechanistic Interpretability
–Neural Information Processing Systems
Through considerable effort and intuition, several recent works have reverseengineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component.
Neural Information Processing Systems
May-28-2025, 20:54:34 GMT
- Country:
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.50)
- North America > United States (1.00)
- Europe > United Kingdom
- Genre:
- Research Report (0.68)
- Workflow (0.94)
- Technology: