LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale Tim Dettmers λ Mike Lewis Luke Zettlemoyer λ
–Neural Information Processing Systems
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8().
Neural Information Processing Systems
Mar-27-2025, 15:53:02 GMT