YouOnlyCacheOnce: Decoder-DecoderArchitecturesforLanguageModels
–Neural Information Processing Systems
However, as the number of serving tokens increases, the key-value (KV) caches occupy a lot of GPU memory, rendering the inference of large language models memory-bounded [29].
Neural Information Processing Systems
Feb-7-2026, 21:23:51 GMT
- Country:
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Genre:
- Research Report (0.68)
- Technology: