NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

Zhang, Tianyi, Yi, Jonah Wonkyu, Yao, Bowen, Xu, Zhaozhuo, Shrivastava, Anshumali

Mar-2-2024–arXiv.org Artificial Intelligence

Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2$\times$ at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist.

computation, efficient llm inference, nomad-attention, (12 more...)

arXiv.org Artificial Intelligence

Mar-2-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Texas > Harris County
    - Houston (0.04)
  - New Jersey > Hudson County
    - Hoboken (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found