Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

Anshumann, null, Zaidi, Mohd Abbas, Kedia, Akhil, Ahn, Jinwoo, Kwon, Taehwak, Lee, Kangwook, Lee, Haejun, Lee, Joohyung

Mar-21-2025–arXiv.org Artificial Intelligence

Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Mar-21-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- North America
  - United States
    - Texas (0.04)
    - Washington > King County
      - Seattle (0.04)
    - New York > New York County
      - New York City (0.04)
    - Nevada > Clark County
      - Las Vegas (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Florida > Miami-Dade County
      - Miami (0.04)
    - California
      - San Francisco County > San Francisco (0.14)
      - Los Angeles County > Long Beach (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Austria (0.04)
  - Sweden > Stockholm
    - Stockholm (0.04)
  - Latvia > Lubāna Municipality
    - Lubāna (0.04)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
- Asia
  - Singapore (0.04)
  - Macao (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - South Korea > Seoul
    - Seoul (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
  - China > Beijing
    - Beijing (0.04)
- Africa
  - Rwanda > Kigali
    - Kigali (0.04)
  - Ethiopia > Addis Ababa
    - Addis Ababa (0.04)

Genre:
- Research Report (0.81)

Industry:
- Education (0.49)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found