Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Zhu, Yue, Yu, Hao, Wang, Chen, Liu, Zhuoran, Lee, Eun Kyung

May-29-2025–arXiv.org Artificial Intelligence

--The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-V alue Cache (KVC) management to optimize inference performance. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference. Large Language Models (LLMs) have shown remarkable ability in tasks like text generation, translation, and question-answering, but their attention architecture introduces significant challenges. The use of key-value caches (KVC) in attention layer of transformer models, while essential for efficient token generation, requires substantial memory resources.

large language model, natural language, workload, (17 more...)

arXiv.org Artificial Intelligence

May-29-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found