SnapKV: LLM Knows What You Are Looking for before Generation Bowen Yang
–Neural Information Processing Systems
Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable accuracy in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an'observation' window located at the end of the prompts.
Neural Information Processing Systems
May-28-2025, 20:58:41 GMT
- Country:
- North America > United States > Illinois (0.14)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.93)
- Research Report
- Technology: