TP-Aware Dequantization

Hoque, Adnan, Srivatsa, Mudhakar, Yang, Chih-Chieh, Ganti, Raghu

Jan-15-2024–arXiv.org Artificial Intelligence

Given the recent advancement of LLMs, deployment optimizations are becoming more crucial as the size of state-ofthe-art LLMs increase in scale. As these these models continue to grow, so does the need to optimize the increasingly parallel and increasingly distributed workload requirements of modern-day deep learning inference. Strategies like GPTQ [1] and Tensor Parallel (TP) [4] are hence essential in achieving high-throughput performance. Our method is motivated by several key properties of GPTQ, TP and General Matrix Multiplication (GEMM). We build on these existing methods and present a key innovation that helps maximize memory throughput and reduce latency. Our method shows up to a 1.81x speedup on Llama-70B and up to a 1.78x speedup on Granite-20B MLP layer problem sizes. We achieve this by reducing global communication and enforcing data locality.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Jan-15-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)
  - Natural Language > Large Language Model (0.87)