LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

Park, Gunho, Park, Baeseong, Kim, Minsub, Lee, Sungjae, Kim, Jeonghoon, Kwon, Beomseok, Kwon, Se Jung, Kim, Byeongwook, Lee, Youngjoo, Lee, Dongsoo

Apr-15-2023–arXiv.org Artificial Intelligence

The recent advancements in self-supervised learning, combined with the Transformer architecture, have enabled natural language processing (NLP) to achieve remarkably low perplexity. However, powerful NLP models necessitate increasing model size, leading to substantial computational and memory requirements. In this paper, we introduce an efficient inference framework tailored for large-scale generative language models. To reduce the model size, we employ a weight-only quantization strategy while preserving full precision for activations. As a result, we attain sub-4-bit quantization for each weight through non-uniform or uniform quantization techniques. Our proposed kernel, called LUT-GEMM, then accelerates quantized matrix multiplications, offering a flexible balance between compression ratio and accuracy. Unlike earlier matrix multiplication kernels that accommodated weight-only quantization, LUT-GEMM efficiently eliminates the resource-demanding dequantization process for both uniform and non-uniform quantization methods. By reducing the latency of individual GPUs and the overall inference process for large-scale language models, LUT-GEMM provides significant performance improvements in inference. The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations, which decreases the number of required GPUs. For the OPT-175B model with 3-bit quantization, we show that LUT-GEMM accelerates the latency for generating each token by 2.1x compared to OPTQ, which requires costly dequantization. Consequently, LUT-GEMM enables inference of the OPT-175B model on a single GPU without noticeable degradation in accuracy or performance, while the non-quantized OPT-175B model requires a minimum of 8 GPUs.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Apr-15-2023

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (0.93)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found