On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference
–arXiv.org Artificial Intelligence
Despite the recent success associated with Large Language Models~(LLMs), they are notably cost-prohibitive to deploy in resource-constrained environments due to their excessive memory and computational demands. In addition to model parameters, the key-value cache is also stored in GPU memory, growing linearly with batch size and sequence length. As a remedy, recent works have proposed various eviction policies for maintaining the overhead of key-value cache under a given budget. This paper embarks on the efficacy of existing eviction policies in terms of \textit{importance score calculation} and \textit{eviction scope construction}. We identify the deficiency of prior policies in these two aspects and introduce RoCo, a \underline{r}\underline{o}bust \underline{c}ache \underline{o}mission policy based on temporal attention scores and robustness measures. Extensive experimentation spanning prefilling and auto-regressive decoding stages validates the superiority of RoCo. Finally, we release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference. Code available at \url{https://github.com/DRSY/EasyKV}.
arXiv.org Artificial Intelligence
Feb-9-2024
- Country:
- Asia
- Europe > Spain
- Catalonia > Barcelona Province > Barcelona (0.04)
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- United States
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Oklahoma (0.04)
- Pennsylvania (0.04)
- Texas (0.04)
- Michigan > Washtenaw County
- Canada
- South America > Chile
- Genre:
- Research Report (0.82)
- Industry:
- Leisure & Entertainment (1.00)
- Media > Television (0.46)
- Technology: