Circuit Distillation
Wadhwa, Somin, Amir, Silvio, Wallace, Byron C.
–arXiv.org Artificial Intelligence
Model distillation typically focuses on behavioral mimicry, where a student model is trained to replicate a teacher's output while treating its internal computations as a black box. In this work we propose an alternative approach: Distilling the underlying computational mechanisms implemented by a teacher model. Specifically, we propose circuit distillation, which introduces an objective to align internal representations between analogous circuit components in teacher and student models. We propose a method to match "functionally correspondent" circuit components and introduce a loss reflecting similarities between the representations that these induce. We evaluate circuit distillation on entity tracking and theory of mind (ToM) tasks using models from the Llama3 family. Our results demonstrate that circuit distillation outperforms standard distillation, successfully transferring algorithmic capabilities by adjusting only a small, targeted subset of student model parameters. This work establishes the feasibility of transferring mechanisms, which may in turn allow for efficient distillation of targeted teacher capabilities via interpretable and controllable internal student mechanisms. Model distillation entails training a relatively small and efficient "student" LM using a larger and more capable teacher LLM (Gou et al., 2021). The prevailing training paradigm is one of behavioral mimicry: The student model is trained to replicate the output distribution of the large "teacher" model. This is typically done by minimizing the divergence between final-layer logits for the predictive task of interest (Hinton et al., 2015). More recent work has has focussed on distilling "reasoning" capabilities (Shridhar et al., 2023; Li et al., 2023; Wadhwa et al., 2024).
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Asia (0.67)
- North America > United States (0.46)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (1.00)
- Technology: