PolarQuant: Leveraging Polar Transformation for Key Cache Quantization and Decoding Acceleration

Neural Information Processing Systems 

The increasing demand for long-context generation has made the KV cache in large language models a bottleneck in memory consumption. Quantizing the cache to lower bit widths is an effective way to reduce memory costs; however, previous methods struggle with key cache quantization due to outliers, resulting in suboptimal performance. We propose a novel quantization approach PolarQuant, which provides a new perspective for key cache quantization and efficiently addresses the outlier dilemma. We observe that the distribution of the key states reveals well-structured patterns under polar transformation. Outliers generally appear in only one of the two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-organized patterns, with radii and angles smoothly distributed in polar space.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found