Billion-scale Similarity Search Using a Hybrid Indexing Approach with Advanced Filtering

Emanuilov, Simeon, Dimov, Aleksandar

arXiv.org Artificial Intelligence 

Similarity search, the task of finding similar vectors, has become a fundamental operation in machine learning, with applications in recommendation engines, semantic search systems, and more [1-3]. As datasets grow to billions of entries, the challenge of performing efficient searches on high-dimensional vectors becomes increasingly complex [4]. This is further compounded by the well-known curse of dimensionality [5], which affects the performance and accuracy of search algorithms as the number of dimensions increases. Approximate Nearest Neighbor (ANN) algorithms, such as Inverted File Index (IVF) [6] and Hierarchical Navigable Small World (HNSW) [7], have been developed to address scalability and performance issues. IVF segments the search space into smaller areas, called Voronoi cells [8], while HNSW constructs a navigable graph structure for efficient search space traversal. Despite their advancements, these methods often struggle to support complex, multi-dimensional filtering efficiently. This is crucial in practical scenarios where additional criteria beyond vector similarity are required to refine search results [6]. Examples of such scenarios include e-commerce product search and semantic search with filtering and recommendation systems.