Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference
Chu, Kexin, Lin, Zecheng, Xiang, Dawei, Shen, Zixu, Su, Jianchang, Chu, Cheng, Yang, Yiwei, Zhang, Wenhui, Wu, Wenfei, Zhang, Wei
–arXiv.org Artificial Intelligence
--Global KV-cache sharing has emerged as a key optimization for accelerating large language model (LLM) inference. However, it exposes a new class of timing side-channel attacks, enabling adversaries to infer sensitive user inputs via shared cache entries. Existing defenses, such as per-user isolation, eliminate leakage but degrade performance by up to 38.9% in time-to-first-token (TTFT), making them impractical for high-throughput deployment. T o address this gap, we introduce SafeKV (Secure and Flexible KV Cache Sharing), a privacy-aware KV-cache management framework that selectively shares non-sensitive entries while confining sensitive content to private caches. SafeKV comprises three components: (i) a hybrid, multi-tier detection pipeline that integrates rule-based pattern matching, a general-purpose privacy detector, and context-aware validation; (ii) a unified radix-tree index that manages public and private entries across heterogeneous memory tiers (HBM, DRAM, SSD); and (iii) entropy-based access monitoring to detect and mitigate residual information leakage. Our evaluation shows that SafeKV mitigates 94%-97% of timing-based side-cahnnel attacks. Compare to per-user isolation method, SafeKV improves TTFT by up to 40.58% and throughput by up to 2.66 across diverse LLMs and workloads. By combining fine-grained privacy control with high cache reuse efficiency, SafeKV reclaims the performance advantages of global sharing while providing robust runtime privacy guarantees for LLM inference. Large language models (LLMs) now underpin applications from dialogue to complex reasoning. To meet time-sensitive inference demands, key-value (KV) caching stores intermediate attention states ("keys" and "values") to eliminate redundant computation for sequential or similar prompts, thereby accelerating generation [70]. This efficiency gain is amplified through KV cache sharing across multiple requests. In particular, prompts with common prefixes, such as shared dialogue history or structured prompting patterns, enable substantial throughput improvements and latency reduction. Consequently, KV -cache sharing has become a critical mechanism for boosting throughput and reducing response latency in large-scale, multi-user LLM deployments. Empirical studies confirm that a substantial portion of real-world prompts exhibit prefix-level or structural overlap [42], [74], making shared KV reuse both practical and highly beneficial. Despite these performance benefits, KV cache sharing raises serious privacy and security concerns in shared or multi-tenant deployments. Specifically, KV -cache sharing across mutually untrusted users can lead to unintended information leakage.
arXiv.org Artificial Intelligence
Aug-13-2025
- Country:
- Asia
- North America > United States
- California
- San Diego County > San Diego (0.04)
- Santa Cruz County > Santa Cruz (0.04)
- Connecticut (0.04)
- Indiana > Monroe County
- Bloomington (0.04)
- California
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: