PLD: AChoice-Theoretic List-Wise Knowledge Distillation

Jun-22-2026, 17:28:00 GMT–Neural Information Processing Systems

Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra-and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce Plackett-Luce Distillation (PLD), a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence.

distillation, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Jun-22-2026, 17:28:00 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (0.46)

Industry:
- Education (0.68)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.67)
  - Artificial Intelligence
    - Vision (0.95)
    - Natural Language (0.93)
    - Representation & Reasoning (0.67)
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found