Online Knowledge Distillation with Reward Guidance

May-27-2025–arXiv.org Artificial Intelligence

This work studies knowledge distillation (KD) for large language models (LLMs) through preference optimization. We propose a reward-guided imitation learning framework for sequential KD, formulating a min-max optimization problem between the policy and reward model (RM) to minimize the performance gap between the student and teacher policies. Specifically, the reward optimization is constrained to achieve near-optimality within a confidence set for preference alignment. For preference data construction, we explore both offline and online preference-based KD. Additionally, we reformulate the RM using the $Q$-value function and extend the framework to white-box KD, where the teacher policy's predicted probabilities are accessible. Theoretical analysis and empirical results demonstrate the effectiveness of the proposed framework.

distillation, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

May-27-2025

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- North America > Canada
  - Ontario > Toronto (0.04)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Education (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.95)
    - Reinforcement Learning (0.93)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Optimization (0.88)