Towards Automated Circuit Discovery for Mechanistic Interpretability

Oct-11-2024, 02:59:06 GMT–Neural Information Processing Systems

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors oftransformer models. This paper systematizes the mechanistic interpretability process they followed. First, researcherschoose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find whichabstract neural network units are involved in the behavior. By varying the dataset, metric, and units underinvestigation, researchers can understand the functionality of each component.We automate one of the process' steps: finding the connections between the abstract neural network units that form a circuit.

automated circuit discovery, mechanistic interpretability, neural network unit, (1 more...)

Neural Information Processing Systems

Oct-11-2024, 02:59:06 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.75)