Self-Ablating Transformers: More Interpretability, Less Sparsity

Ferrao, Jeremias, Mikaelson, Luhan, Pepper, Keenan, Antolin, Natalia Perez-Campanero

arXiv.org Artificial Intelligence 

A growing intuition in machine learning suggests a link between sparsity and in-terpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach dynamically enforces a k-winner-takes-all constraint, forcing the model to demonstrate selective activation across neuron and attention units. Unlike post-hoc methods that analyze already-trained models, our approach integrates interpretabil-ity directly into model training, promoting feature localization from inception. Training small models on the TinyStories dataset and employing interpretabil-ity tests, we find that self-ablation leads to more localized circuits, concentrated feature representations, and increased neuron specialization without compromising language modelling performance. Surprisingly, our method also decreased overall sparsity, indicating that self-ablation promotes specialization rather than widespread inactivity. This reveals a complex interplay between sparsity and in-terpretability, where decreased global sparsity can coexist with increased local specialization, leading to enhanced interpretability. To facilitate reproducibility, we make our code available at https://github.com/keenanpepper/ As machine learning systems are entrusted with increasingly complex tasks, our ability to understand their decision-making processes lags behind their growing capabilities (OpenAI, 2024; Gemini Team, Google, 2024). Much of the current research in interpretability for LLMs focuses on developing post-hoc methods, attempting to explain the behaviour of already-trained models (Ribeiro et al., 2016; Conmy et al., 2023; Foote et al., 2023b; Bills et al., 2023; Huben et al., 2024). While valuable, these approaches often provide only an approximate or incomplete understanding of the underlying mechanisms (Rudin, 2019). A more fundamental yet less studied approach involves designing models to be inherently more interpretable, an ante-hoc approach, where transparency is woven into the architecture itself (Slavin et al., 2018; Tamkin et al., 2023; Cloud et al., 2024; Liu et al., 2024).