PiKV: KV Cache Management System for Mixture of Experts

Liu, Dong, Yu, Yanxuan, Lengerich, Ben, Wu, Ying Nian, Wang, Xuhong

Aug-12-2025–arXiv.org Artificial Intelligence

As large language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: \href{https://github.com/NoakLiu/PiKV}{https://github.com/NoakLiu/PiKV}. Experiments details is recorded at: \href{https://github.com/NoakLiu/PiKV/blob/main/downstream_tasks/README.md}{https://github.com/NoakLiu/PiKV/Experimental\_Results}. We also have PiKV integrated with Nvidia kvpress for acceleration, details see \href{https://github.com/NoakLiu/PiKVpress}{https://github.com/NoakLiu/PiKVpress}. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Aug-12-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Wisconsin > Dane County
    - Madison (0.14)
  - New York > New York County
    - New York City (0.04)
  - New Mexico > Bernalillo County
    - Albuquerque (0.04)
  - Connecticut > New Haven County
    - New Haven (0.04)
  - California > Los Angeles County
    - Los Angeles (0.28)
- Asia > China
  - Shanghai > Shanghai (0.04)

Genre:
- Research Report (0.82)

Industry:
- Energy (0.47)
- Information Technology (0.34)

Technology:
- Information Technology
  - Hardware (1.00)
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Representation & Reasoning > Optimization (0.68)
    - Natural Language > Large Language Model (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found