Dhulipala, Laxman
The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering
Yu, Shangdi, Shi, Jessica, Meindl, Jamison, Eisenstat, David, Ju, Xiaoen, Tavakkol, Sasan, Dhulipala, Laxman, Łącki, Jakub, Mirrokni, Vahab, Shun, Julian
We introduce the ParClusterers Benchmark Suite (PCBS) -- a collection of highly scalable parallel graph clustering algorithms and benchmarking tools that streamline comparing different graph clustering algorithms and implementations. The benchmark includes clustering algorithms that target a wide range of modern clustering use cases, including community detection, classification, and dense subgraph mining. The benchmark toolkit makes it easy to run and evaluate multiple instances of different clustering algorithms, which can be useful for fine-tuning the performance of clustering on a given task, and for comparing different clustering algorithms based on different metrics of interest, including clustering quality and running time. Using PCBS, we evaluate a broad collection of real-world graph clustering datasets. Somewhat surprisingly, we find that the best quality results are obtained by algorithms that not included in many popular graph clustering toolkits. The PCBS provides a standardized way to evaluate and judge the quality-performance tradeoffs of the active research area of scalable graph clustering algorithms. We believe it will help enable fair, accurate, and nuanced evaluation of graph clustering algorithms in the future.
Results of the Big ANN: NeurIPS'23 competition
Simhadri, Harsha Vardhan, Aumüller, Martin, Ingber, Amir, Douze, Matthijs, Williams, George, Manohar, Magdalen Dobson, Baranchuk, Dmitry, Liberty, Edo, Liu, Frank, Landrum, Ben, Karjikar, Mazin, Dhulipala, Laxman, Chen, Meng, Chen, Yue, Ma, Rui, Zhang, Kai, Cai, Yuzheng, Shi, Jiayang, Chen, Yizhuo, Zheng, Weiguo, Wan, Zihao, Yin, Jie, Huang, Ben
The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.
Approximate Nearest Neighbor Search with Window Filters
Engels, Joshua, Landrum, Benjamin, Yu, Shangdi, Dhulipala, Laxman, Shun, Julian
The nearest neighbor search problem has been widely studied for more than 30 years (Arya & Mount, 1993). Given Although this problem has many motivating examples, there a dataset D, the problem requires the construction of an is a dearth of papers examining it in the literature. Some vector index that can efficiently answer queries of the form "what databases analyze window search-like problem instances is the closest vector to x in D?" Solving this problem exactly as an additional feature of their system, but this analysis degrades to a brute force linear search in high dimensions is typically secondary to their main approach and too slow (Rubinstein, 2018), so instead both theoreticians and for large-scale real-world systems; as far as we are aware, practitioners focus on the relaxed c-approximate nearest we are the first to propose, analyze, and experiment with a neighbor search problem (ANNS), which asks "what is a non-trivial solution to the window search problem.
Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis
Dobson, Magdalen, Shen, Zheqi, Blelloch, Guy E., Dhulipala, Laxman, Gu, Yan, Simhadri, Harsha Vardhan, Sun, Yihan
Algorithms for approximate nearest-neighbor search (ANNS) have been the topic of significant recent interest in the research community. However, evaluations of such algorithms are usually restricted to a small number of datasets with millions or tens of millions of points, whereas real-world applications require algorithms that work on the scale of billions of points. Furthermore, existing evaluations of ANNS algorithms are typically heavily focused on measuring and optimizing for queries-per second (QPS) at a given accuracy, which can be hardware-dependent and ignores important metrics such as build time. In this paper, we propose a set of principled measures for evaluating ANNS algorithms which refocuses on their scalability to billion-size datasets. These measures include ability to be efficiently parallelized, build times, and scaling relationships as dataset size increases. We also expand on the QPS measure with machine-agnostic measures such as the number of distance computations per query, and we evaluate ANNS data structures on their accuracy in more demanding settings required in modern applications, such as evaluating range queries and running on out-of-distribution data. We optimize four graph-based algorithms for the billion-scale setting, and in the process provide a general framework for making many incremental ANNS graph algorithms lock-free. We use our framework to evaluate the aforementioned graph-based ANNS algorithms as well as two alternative approaches.
Scalable Community Detection via Parallel Correlation Clustering
Shi, Jessica, Dhulipala, Laxman, Eisenstat, David, Łącki, Jakub, Mirrokni, Vahab
Graph clustering and community detection are central problems in modern data mining. The increasing need for analyzing billion-scale data calls for faster and more scalable algorithms for these problems. There are certain trade-offs between the quality and speed of such clustering algorithms. In this paper, we design scalable algorithms that achieve high quality when evaluated based on ground truth. We develop a generalized sequential and shared-memory parallel framework based on the LambdaCC objective (introduced by Veldt et al.), which encompasses modularity and correlation clustering. Our framework consists of highly-optimized implementations that scale to large data sets of billions of edges and that obtain high-quality clusters compared to ground-truth data, on both unweighted and weighted graphs. Our empirical evaluation shows that this framework improves the state-of-the-art trade-offs between speed and quality of scalable community detection. For example, on a 30-core machine with two-way hyper-threading, our implementations achieve orders of magnitude speedups over other correlation clustering baselines, and up to 28.44x speedups over our own sequential baselines while maintaining or improving quality.
Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time
Dhulipala, Laxman, Eisenstat, David, Łącki, Jakub, Mirrokni, Vahab, Shi, Jessica
We study the widely used hierarchical agglomerative clustering (HAC) algorithm on edge-weighted graphs. We define an algorithmic framework for hierarchical agglomerative graph clustering that provides the first efficient $\tilde{O}(m)$ time exact algorithms for classic linkage measures, such as complete- and WPGMA-linkage, as well as other measures. Furthermore, for average-linkage, arguably the most popular variant of HAC, we provide an algorithm that runs in $\tilde{O}(n\sqrt{m})$ time. For this variant, this is the first exact algorithm that runs in subquadratic time, as long as $m=n^{2-\epsilon}$ for some constant $\epsilon > 0$. We complement this result with a simple $\epsilon$-close approximation algorithm for average-linkage in our framework that runs in $\tilde{O}(m)$ time. As an application of our algorithms, we consider clustering points in a metric space by first using $k$-NN to generate a graph from the point set, and then running our algorithms on the resulting weighted graph. We validate the performance of our algorithms on publicly available datasets, and show that our approach can speed up clustering of point datasets by a factor of 20.7--76.5x.