On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference

Feb-9-2024–arXiv.org Artificial Intelligence

Despite the recent success associated with Large Language Models~(LLMs), they are notably cost-prohibitive to deploy in resource-constrained environments due to their excessive memory and computational demands. In addition to model parameters, the key-value cache is also stored in GPU memory, growing linearly with batch size and sequence length. As a remedy, recent works have proposed various eviction policies for maintaining the overhead of key-value cache under a given budget. This paper embarks on the efficacy of existing eviction policies in terms of \textit{importance score calculation} and \textit{eviction scope construction}. We identify the deficiency of prior policies in these two aspects and introduce RoCo, a \underline{r}\underline{o}bust \underline{c}ache \underline{o}mission policy based on temporal attention scores and robustness measures. Extensive experimentation spanning prefilling and auto-regressive decoding stages validates the superiority of RoCo. Finally, we release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference. Code available at \url{https://github.com/DRSY/EasyKV}.

arxiv preprint arxiv, broadway, eviction policy, (13 more...)

arXiv.org Artificial Intelligence

Feb-9-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America
  - United States
    - Oklahoma (0.04)
    - Texas (0.04)
    - Pennsylvania (0.04)
    - Michigan > Washtenaw County
      - Ann Arbor (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe > Spain
  - Catalonia > Barcelona Province > Barcelona (0.04)
- Asia
  - Singapore (0.04)
  - Middle East > UAE (0.04)
  - China > Shanghai
    - Shanghai (0.04)

Genre:
- Research Report (0.82)

Industry:
- Leisure & Entertainment (1.00)
- Media > Television (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.96)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)