TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Zandieh, Amir, Daliri, Majid, Hadian, Majid, Mirrokni, Vahab

Apr-29-2025–arXiv.org Artificial Intelligence

Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quan-tizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ( 2. 7) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero. 1 Introduction Vector quantization (VQ) in Euclidean space is crucial for efficiently handling high-dimensional vectors across a spectrum of computational domains, from training and deploying large-scale AI and deep learning models to powering vector databases for search/retrieval systems. The core objective is to compress high dimensional vectors by quantizing them-converting floating-point coordinate values to low-bitwidth integers-while minimizing distortion, quantified by metrics such as 1 arXiv:2504.19874v1 By preserving these properties, inner product queries can be answered rapidly, with minimal latency, and using reduced computational and communication resources. This problem's roots trace back to Shannon's seminal work on Source Coding theory [48, 49], which established that the least distortion achievable by block source codes, now known as vector quan-tizers, is defined by the Shannon distortion-rate function, determined by the statistical properties of the source and the chosen distortion measure, such as MSE. Today, VQ plays a critical role in fundamental computational domains, including AI, deep learning, and search systems. A key application of VQ is in the deployment of AI models, including large language models (LLMs) [5, 18, 7, 52].

large language model, machine learning, quantization, (16 more...)

arXiv.org Artificial Intelligence

Apr-29-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Qatar > Ad-Dawhah > Doha (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- North America > United States
  - New York (0.04)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found