Tracr: Compiled Transformers as a Laboratory for Interpretability

Jan-19-2025, 09:00:13 GMT–Neural Information Processing Systems

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods.

interpretability, tracr, transformer, (1 more...)

Neural Information Processing Systems

Jan-19-2025, 09:00:13 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.49)