Efficient Generative LLM Inference with Recallable Key-Value Eviction
–Neural Information Processing Systems
Large Language Models (LLMs) are widely used in today's tasks of natural language processing. To support applications like multi-turn chats, document understanding, and content generation, models with long context lengths are growing in importance. However, managing long contexts brings substantial challenges due to the expansion of key-value cache (KV cache). Longer KV cache requires larger memory, limiting the batch-size and thus decreasing throughput.
Neural Information Processing Systems
Jun-1-2025, 20:48:14 GMT
- Country:
- Asia
- North America > United States (0.14)
- Genre:
- Research Report > Experimental Study (0.93)
- Technology: