Towards Automated Circuit Discovery for Mechanistic Interpretability

Neural Information Processing Systems 

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found