YouOnlyCacheOnce: Decoder-DecoderArchitecturesforLanguageModels

Neural Information Processing Systems 

However, as the number of serving tokens increases, the key-value (KV) caches occupy a lot of GPU memory, rendering the inference of large language models memory-bounded [29].