Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference
Zhu, Yue, Yu, Hao, Wang, Chen, Liu, Zhuoran, Lee, Eun Kyung
–arXiv.org Artificial Intelligence
--The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-V alue Cache (KVC) management to optimize inference performance. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference. Large Language Models (LLMs) have shown remarkable ability in tasks like text generation, translation, and question-answering, but their attention architecture introduces significant challenges. The use of key-value caches (KVC) in attention layer of transformer models, while essential for efficient token generation, requires substantial memory resources.
arXiv.org Artificial Intelligence
May-29-2025
- Country:
- Asia > Mongolia (0.04)
- North America > United States (0.04)
- Genre:
- Research Report (1.00)
- Technology: