Tracr: Compiled Transformers as a Laboratory for Interpretability

Dec-26-2025, 03:47:17 GMT–Neural Information Processing Systems

We show how to compile human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study superposition in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as for evaluating interpretability methods. Commonly, because the programs learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking.

name change, tracr, transformer, (3 more...)

Neural Information Processing Systems

Dec-26-2025, 03:47:17 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.40)