Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective
Zimmer, Matthieu, Ji, Xiaotong, Nguyen, Tu, Ammar, Haitham Bou
–arXiv.org Artificial Intelligence
We introduce a novel approach to large language model (LLM) distillation by formulating it as a constrained reinforcement learning problem. While recent work has begun exploring the integration of task-specific rewards into distillation processes, existing methods typically rely on ad-hoc reward weighting. We propose a principled optimization framework that maximizes task-specific rewards while constraining the divergence from the teacher model to remain below a specified threshold. Our approach adapts constrained state augmented reinforcement learning to the distillation setting, introducing a modified reward function that maintains theoretical guarantees of constraint satisfaction without requiring state augmentation or teacher model access during deployment and without the computational overhead of the dual Lagrangian methods. Through extensive experiments on mathematical reasoning tasks, we demonstrate that our method achieves better constraint satisfaction rates and better reasoning compared to the soft Lagrangian relaxation baselines while maintaining competitive task performance. Our framework provides a theoretically grounded and practically efficient solution for reward-aware distillation in resource-constrained settings. Large Language Models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks (V aswani et al., 2017; Trinh et al., 2024; Chervonyi et al., 2025; Guo et al., 2025; Christianos et al., 2023), but their size and complexity make them impractical for deployment in resource-constrained environments. Distillation (Hinton et al., 2015; Czarnecki et al., 2019), a technique where a smaller student model learns from a larger teacher model, has been widely used to transfer knowledge while reducing computational costs. Conventional distillation methods (Sanh et al., 2020; Gu et al., 2024; Ko et al., 2024) typically focus on minimizing the divergence between the student and teacher models, often using metrics such as Kullback-Leibler (KL) divergence. However, these methods do not fully leverage additional reward signals that can provide valuable guidance, particularly in tasks requiring complex reasoning.
arXiv.org Artificial Intelligence
Sep-30-2025