Accurate KVCache Eviction via Anchor Direction Projection for Efficient LLMInference

Neural Information Processing Systems 

Key-Value (KV) cache eviction--which retains the KV pairs of the most important tokens while discarding less important ones--is a critical technique for optimizing both memory usage and inference latency in large language models (LLMs). However, existing approaches often rely on simple heuristics--such as attention weights--to measure token importance, overlooking the spatial relationships between token value states in the vector space. This often leads to suboptimal token selections and thus performance degradation. To tackle this problem, we propose a novel method, namely AnDPro (Anchor Direction Projection), which introduces a projection-based scoring function to more accurately measure token importance. Specifically, AnDPro operates in the space of value vectors and leverages the projections of these vectors onto an "Anchor Direction"--the direction of the pre-eviction output--to measure token importance and guide more accurate token selection. Experiments on 16datasets from the LongBench benchmark demonstrate that AnDPro can maintain 96.07%of the full cache accuracy using only 3.44%KV cache budget, reducing KV cache budget size by 46.0% without compromising quality compared to previous state-of-the-arts.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found