Token-Scaled Logit Distillation for Ternary Weight Generative Language Models
–Neural Information Processing Systems
Generative Language Models (GLMs) have shown impressive performance in tasks such as text generation, understanding, and reasoning. However, the large model size poses challenges for practical deployment. To solve this problem, Quantization-Aware Training (QAT) has become increasingly popular. However, current QAT methods for generative models have resulted in a noticeable loss of accuracy. To counteract this issue, we propose a novel knowledge distillation method specifically designed for GLMs. Our method, called token-scaled logit distillation, prevents overfitting and provides superior learning from the teacher model and ground truth.
generative language model, ternary weight generative language model, token-scaled logit distillation, (2 more...)
Neural Information Processing Systems
Jan-19-2025, 12:09:43 GMT
- Technology: