Deep Compression of Pre-trained Transformer Models

Neural Information Processing Systems 

Due to their excellent computational efficiency and scalability, transformer models can be trained on exceedingly large amounts of data at the expense of tremendous growth in model size.