Goto

Collaborating Authors

 accurate billion-point nearest neighbor search


DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node

Neural Information Processing Systems

Current state-of-the-art approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search. This makes them expensive and limits the size of the dataset. We present a new graph-based indexing and search system called DiskANN that can index, store, and search a billion point database on a single workstation with just 64GB RAM and an inexpensive solid-state drive (SSD). Contrary to current wisdom, we demonstrate that the SSD-based indices built by DiskANN can meet all three desiderata for large-scale ANNS: high-recall, low query latency and high density (points indexed per node). On the billion point SIFT1B bigann dataset, DiskANN serves > 5000 queries a second with < 3ms mean latency and 95%+ 1-recall@1 on a 16 core machine, where state-of-the-art billion-point ANNS algorithms with similar memory footprint like FAISS and IVFOADC+G+P plateau at around 50% 1-recall@1. Alternately, in the high recall regime, DiskANN can index and serve 5 10x more points per node compared to state-of-the-art graph-based methods such as HNSW and NSG. Finally, as part of our overall DiskANN system, we introduce Vamana, a new graph-based ANNS index that is more versatile than the graph indices even for in-memory indices.


Reviews: DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node

Neural Information Processing Systems

The writing could be improved, but it's in general understandable. However, citation quality can be improved. In particular, it seems to me that NSG and HNSW are actually using the same pruning rule (which results in approximate relative neighborhood graph). I really like your updated version, which reduces the number hops (and I haven't seen this pruning variant before)! Detailed comments: Abstract and further: base points sounds like a strange term, do you mean domain points? Please, find a more specific-generic citation that describes this phenomena.


Reviews: DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node

Neural Information Processing Systems

In post rebuttal discussions, reviewers concurred in subsequent discussions that the paper presents solid state of art implementation and very impressive results, which will have good impact for practitioners. This significant impact by itself was worthy of publication.


DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node

Neural Information Processing Systems

Current state-of-the-art approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search. This makes them expensive and limits the size of the dataset. We present a new graph-based indexing and search system called DiskANN that can index, store, and search a billion point database on a single workstation with just 64GB RAM and an inexpensive solid-state drive (SSD). Contrary to current wisdom, we demonstrate that the SSD-based indices built by DiskANN can meet all three desiderata for large-scale ANNS: high-recall, low query latency and high density (points indexed per node). On the billion point SIFT1B bigann dataset, DiskANN serves 5000 queries a second with 3ms mean latency and 95% 1-recall@1 on a 16 core machine, where state-of-the-art billion-point ANNS algorithms with similar memory footprint like FAISS and IVFOADC G P plateau at around 50% 1-recall@1.


Rand-NSG: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node

Neural Information Processing Systems

Current state-of-the-art approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search. This makes them expensive and limits the size of the dataset. We present a new graph-based indexing and search system called DiskANN that can index, store, and search a billion point database on a single workstation with just 64GB RAM and an inexpensive solid-state drive (SSD). Contrary to current wisdom, we demonstrate that the SSD-based indices built by DiskANN can meet all three desiderata for large-scale ANNS: high-recall, low query latency and high density (points indexed per node). On the billion point SIFT1B bigann dataset, DiskANN serves 5000 queries a second with 3ms mean latency and 95% 1-recall@1 on a 16 core machine, where state-of-the-art billion-point ANNS algorithms with similar memory footprint like FAISS and IVFOADC G P plateau at around 50% 1-recall@1.