matthij douze
Machine learning and high dimensional vector search
Most high-dimensional vector search methods are based on st atistical tools, signal processing approaches or graph traversal algorithms. Statistical tools include random projections [15], dimensionality reduction (PCA and the SVD). Signal processing is employed p rimarily to compress vectors with quantization [30, 4, 22] Most recent indexing methods are rely on graphs [34, 49, 3, 11] that are built with graph traversal heuristics. Vector search (VS) is used in machine learning (ML) for train ing data deduplication [39] and searching ML embeddings [28, 5]. Therefore, there are many r esearch teams around the world that are competent in both fields.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search
Severo, Daniel, Ottaviano, Giuseppe, Muckley, Matthew, Ullrich, Karen, Douze, Matthijs
Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index size. Furthermore, we show that for some datasets, these methods can also compress the quantized vector codes losslessly, by exploiting sub-optimalities in the original quantization algorithm. The source code for our approach available at https://github.com/facebookresearch/vector_db_id_compression.
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Residual Quantization with Implicit Neural Codebooks
Huijben, Iris, Douze, Matthijs, Muckley, Matthew, van Sloun, Ruud, Verbeek, Jakob
Vector quantization is a fundamental operation for data compression and vector search. To obtain high accuracy, multi-codebook methods increase the rate by representing each vector using codewords across multiple codebooks. Residual quantization (RQ) is one such method, which increases accuracy by iteratively quantizing the error of the previous step. The error distribution is dependent on previously selected codewords. This dependency is, however, not accounted for in conventional RQ as it uses a generic codebook per quantization step. In this paper, we propose QINCo, a neural RQ variant which predicts specialized codebooks per vector using a neural network that is conditioned on the approximation of the vector from previous steps. Experiments show that QINCo outperforms state-of-the-art methods by a large margin on several datasets and code sizes. For example, QINCo achieves better nearest-neighbor search accuracy using 12 bytes codes than other methods using 16 bytes on the BigANN and Deep1B dataset.