ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction
–Neural Information Processing Systems
Large Language Models (LLMs) are widely used in today's tasks of natural language processing. To support applications like multi-turn chats, document understanding, and content generation, models with long context lengths are growing in importance.However, managing long contexts brings substantial challenges due to the expansion of key-value cache (KV cache). Longer KV cache requires larger memory, limiting the batch-size thus decreasing throughput. Also, computing attention over long KV cache incurs more memory access, hurting the end-to-end latency.Prior works find that it is sufficient to use only the recent and high-impact tokens for attention computation, allowing the eviction of less vital tokens to shrink cache size.Nonetheless, we observe a dynamic shift in token importance across different decoding steps. Tokens initially evicted might regain importance after certain decoding steps.To address this, we propose ArkVale, a page-based KV cache manager that can recognize and recall currently important tokens evicted before.
Neural Information Processing Systems
May-27-2025, 16:57:34 GMT
- Technology: