SALS: Sparse Attention in Latent Space for KV Cache Compression

Jun-9-2026, 14:35:35 GMT–Neural Information Processing Systems

Large Language Models (LLMs) capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value (KV) cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding (RoPE) mechanism in modern LLMs, naive low -rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space (SALS) framework.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Jun-9-2026, 14:35:35 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report > New Finding (0.38)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)