LM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
–Neural Information Processing Systems
The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher).
Neural Information Processing Systems
Oct-2-2025, 18:08:19 GMT