Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs
Anshumann, null, Zaidi, Mohd Abbas, Kedia, Akhil, Ahn, Jinwoo, Kwon, Taehwak, Lee, Kangwook, Lee, Haejun, Lee, Joohyung
–arXiv.org Artificial Intelligence
Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.
arXiv.org Artificial Intelligence
Mar-21-2025
- Country:
- Oceania > Australia (0.04)
- North America
- United States
- Texas (0.04)
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- California
- San Francisco County > San Francisco (0.14)
- Los Angeles County > Long Beach (0.04)
- Canada
- Ontario > Toronto (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- United States
- Europe
- Asia
- Africa
- Rwanda > Kigali
- Kigali (0.04)
- Ethiopia > Addis Ababa
- Addis Ababa (0.04)
- Rwanda > Kigali
- Genre:
- Research Report (0.81)
- Industry:
- Education (0.49)
- Technology: