H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Jan-19-2025, 03:27:27 GMT–Neural Information Processing Systems

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the \mathsf{KV} \mathsf{cache}, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the \mathsf{KV} \mathsf{cache} which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters ( \mathsf{H_2}).

efficient generative inference, language model, mathsf, (4 more...)

Neural Information Processing Systems

Jan-19-2025, 03:27:27 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)