An Online Hierarchical Algorithm for Extreme Clustering
Kobren, Ari, Monath, Nicholas, Krishnamurthy, Akshay, McCallum, Andrew
Clustering algorithms are a crucial component of any data scientist's toolbox with applications ranging from identifying themes in large text corpora [10], to finding functionally similar genes [17], to visualization, pre-processing, and dimensionality reduction [21]. As such, a number of clustering algorithms have been developed and studied by the statistics, machine learning, and theoretical computer science communities. These algorithms and analyses target a variety of scenarios, including large-scale, online, or streaming settings [38, 1], clustering with distribution shift [2], and many more. Modern clustering applications require algorithms that scale gracefully with dataset size and complexity. In clustering, data set size is measured by the number of points N and their dimensionality d, while the number of clusters, K, serves as a measure of complexity. While several existing algorithms can cope with large datasets, very few adequately handle datasets with many clusters. We call problem instances with large N and large K extreme clustering problems-a phrase inspired by work in extreme classification [12]. Extreme clustering problems are increasingly prevalent. For example, in entity resolution, record linkage and deduplication, the number of clusters (i.e., entities) increases with dataset size [6] and can be in the The first two authors contributed equally.
Apr-6-2017