Tail-Optimized Caching for LLM Inference

Jun-14-2026, 07:12:54 GMT–Neural Information Processing Systems

Prompt caching is critical for reducing latency and cost in LLM inference---OpenAI and Anthropic report up to 50-90\% cost savings through prompt reuse. Despite its widespread success, little is known about what constitutes an optimal prompt caching policy, particularly when optimizing tail latency--a metric of central importance to practitioners. The widely used Least Recently Used (LRU) policy can perform arbitrarily poor on this metric, as it is oblivious to the heterogeneity of conversation lengths. To address this gap, we propose Tail-Optimized LRU, a simple two-line modification that reallocates KV cache capacity to prioritize high-latency conversations by evicting cache entries that are unlikely to affect future turns. Though the implementation is simple, we prove its optimality under a natural stochastic model of conversation dynamics, providing the first theoretical justification for LRU in this setting---a result that may be of independent interest to the caching community.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Jun-14-2026, 07:12:54 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.65)