H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
–Neural Information Processing Systems
Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the \mathsf{KV} \mathsf{cache}, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the \mathsf{KV} \mathsf{cache} which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters ( \mathsf{H_2}).
Neural Information Processing Systems
Jan-19-2025, 03:27:27 GMT
- Technology: