Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers
Chen, Yifang, Li, Xiaoyu, Liang, Yingyu, Shi, Zhenmei, Song, Zhao, Tian, Yu
–arXiv.org Artificial Intelligence
The key-value (KV) cache in autoregressive transformers presents a significant bottleneck during inference, which restricts the context length capabilities of large language models (LLMs). While previous work analyzes the fundamental space complexity barriers in standard attention mechanism [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a novel reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = \Omega(\log n)$. In the low dimensional regime where $d = o(\log n)$, we analyze the theoretical bounds of the space complexity as well. Overall, our work provides a theoretical foundation for us to understand the compression-expressivity tradeoff in tensor attention mechanisms and offers more perspectives in developing more memory-efficient transformer architectures.
arXiv.org Artificial Intelligence
Mar-14-2025
- Country:
- South America > Chile
- North America > United States
- Wisconsin > Dane County
- Madison (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Wisconsin > Dane County
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Spain > Galicia
- Madrid (0.04)
- United Kingdom > England
- Asia
- Middle East > Jordan (0.04)
- China > Hong Kong (0.04)
- Africa > Senegal
- Kolda Region > Kolda (0.04)
- Genre:
- Research Report (1.00)
- Industry:
- Energy (0.92)
- Technology: