ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Neural Information Processing Systems 

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.