An Online Hierarchical Algorithm for Extreme Clustering

Kobren, Ari, Monath, Nicholas, Krishnamurthy, Akshay, McCallum, Andrew

Apr-6-2017–arXiv.org Machine Learning

Clustering algorithms are a crucial component of any data scientist's toolbox with applications ranging from identifying themes in large text corpora [10], to finding functionally similar genes [17], to visualization, pre-processing, and dimensionality reduction [21]. As such, a number of clustering algorithms have been developed and studied by the statistics, machine learning, and theoretical computer science communities. These algorithms and analyses target a variety of scenarios, including large-scale, online, or streaming settings [38, 1], clustering with distribution shift [2], and many more. Modern clustering applications require algorithms that scale gracefully with dataset size and complexity. In clustering, data set size is measured by the number of points N and their dimensionality d, while the number of clusters, K, serves as a measure of complexity. While several existing algorithms can cope with large datasets, very few adequately handle datasets with many clusters. We call problem instances with large N and large K extreme clustering problems-a phrase inspired by work in extreme classification [12]. Extreme clustering problems are increasingly prevalent. For example, in entity resolution, record linkage and deduplication, the number of clusters (i.e., entities) increases with dataset size [6] and can be in the The first two authors contributed equally.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

Apr-6-2017

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.68)

Genre:
- Research Report (0.50)

Technology:
- Information Technology
  - Data Science (1.00)
  - Artificial Intelligence > Machine Learning
    - Statistical Learning > Clustering (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found