LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale Tim Dettmers λ Mike Lewis Luke Zettlemoyer λ

Mar-27-2025, 15:53:02 GMT–Neural Information Processing Systems

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8().

large language model, machine learning, quantization, (17 more...)

Neural Information Processing Systems

Mar-27-2025, 15:53:02 GMT

Conferences PDF

Add feedback

Country:
- North America
  - Canada (0.46)
  - United States (0.28)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale Tim Dettmers λ Mike Lewis Luke Zettlemoyer

Similar Docs Excel Report more

Title	Similarity	Source
None found