KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

Yang, Yifei, Cao, Zouying, Chen, Qiguang, Qin, Libo, Yang, Dongjie, Zhao, Hai, Chen, Zhi

Oct-24-2024–arXiv.org Artificial Intelligence

The development of large language models (LLMs) has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called KVSharer, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Although the KV cache Figure 1: Previous methods primarily focus greatly helps improve inference speed, it also significantly on discarding Keys and Values within layers. During the LLM inference In contrast, we share KV caches across layers phase, the KV cache typically accounts for based on their dissimilarity. Recent research has seen a proliferation of methods aimed at compressing KV caches to reduce memory consumption (Zandieh et al., 2024; Xu et al., 2024; Yang et al., 2024b; Zhang et al., 2024b;a; Dong et al., 2024). However, these efforts have predominantly focused on intra-layer KV cache compression within individual Transformer layers of LLM.

kv cache, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

Oct-24-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Natural Language > Large Language Model (1.00)