LM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Neural Information Processing Systems 

The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher).