Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

Kim, Junhyuck, Park, Jongho, Cho, Jaewoong, Papailiopoulos, Dimitris

arXiv.org Artificial Intelligence 

We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of 4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7 better compression on LongBench and GSM8K while maintaining high accuracy. Figure 1: Memory usage vs. performance of Lexico compared to other key-value (KV) cache compression methods on GSM8K. The figure illustrates the relationship between KV cache size and the performance of Lexico on Llama models on GSM8K 5-shot evaluation. Lexico consistently outperforms both eviction-based methods (SnapKV, PyramidKV) and quantization-based methods (per-token quantization, KIVI, ZipCache). Transformers (Vaswani et al., 2017) have become the backbone of frontier Large Language Models (LLMs), driving progress in domains beyond natural language processing. However, Transformers are typically limited by their significant memory requirements. This stems not only from the large number of model parameters, but also from the having to maintain the KV cache that grows proportional to the model size (i.e., the number of layers, heads, and also embedding dimension) and token length of the input.