Efficient Generative LLM Inference with Recallable Key-Value Eviction

Jun-1-2025, 20:48:14 GMT–Neural Information Processing Systems

Large Language Models (LLMs) are widely used in today's tasks of natural language processing. To support applications like multi-turn chats, document understanding, and content generation, models with long context lengths are growing in importance. However, managing long contexts brings substantial challenges due to the expansion of key-value cache (KV cache). Longer KV cache requires larger memory, limiting the batch-size and thus decreasing throughput.

kv cache, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Jun-1-2025, 20:48:14 GMT

Conferences PDF

Add feedback

Country:
- Asia
  - China (0.14)
  - Singapore (0.14)
- North America > United States (0.14)

Genre:
- Research Report > Experimental Study (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)