MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Zuhri, Zayd Muhammad Kawakibi, Adilazuarda, Muhammad Farid, Purwarianti, Ayu, Aji, Alham Fikri

Jun-15-2024–arXiv.org Artificial Intelligence

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down Figure 1: Simplified overview of current KV sharing to a factor of 6x compared to MQA. These methods, vanilla MHA (top left), MQA (bottom left), results highlight MLKV's potential for efficient and GQA (top right). All of them share KV heads deployment of transformer models at within the same layer. Our proposed KV sharing scheme scale. We provide code at https://github. MLKV (bottom right) shares KV heads between layers.

kv head, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Jun-15-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Germany (0.14)
- North America (0.14)

Genre:
- Overview (0.68)
- Research Report (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Natural Language > Large Language Model (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found