Towards Automated Circuit Discovery for Mechanistic Interpretability
–Neural Information Processing Systems
Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior.
Neural Information Processing Systems
Oct-8-2025, 10:30:19 GMT
- Country:
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Italy (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Belgium > Brussels-Capital Region
- North America
- Canada > British Columbia
- Vancouver (0.04)
- Dominican Republic (0.04)
- United States
- California > Los Angeles County
- Long Beach (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Washington > King County
- Seattle (0.14)
- California > Los Angeles County
- Canada > British Columbia
- Europe
- Genre:
- Research Report (0.68)
- Workflow (0.94)
- Technology: