AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Semi-crowdsourced Clustering with Deep Generative Models

Luo, Yucen, TIAN, TIAN, Shi, Jiaxin, Zhu, Jun, Zhang, Bo

Neural Information Processing SystemsFeb-14-2020, 12:12:03 GMT

We consider the semi-supervised clustering problem where crowdsourcing provides noisy information about the pairwise comparisons on a small subset of data, i.e., whether a sample pair is in the same cluster. We propose a new approach that includes a deep generative model (DGM) to characterize low-level features of the data, and a statistical relational model for noisy pairwise annotations on its subset. The two parts share the latent variables. To make the model automatically trade-off between its complexity and fitting data, we also develop its fully Bayesian variant. The challenge of inference is addressed by fast (natural-gradient) stochastic variational inference algorithms, where we effectively combine variational message passing for the relational part and amortized learning of the DGM under a unified framework.

deep generative model, semi-crowdsourced clustering, subset

Neural Information Processing Systems

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (0.82)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search

Moseley, Benjamin, Wang, Joshua

Neural Information Processing SystemsFeb-14-2020, 11:57:56 GMT

Hierarchical clustering is a data analysis method that has been used for decades. Despite its widespread use, the method has an underdeveloped analytical foundation. Having a well understood foundation would both support the currently used methods and help guide future improvements. The goal of this paper is to give an analytic framework to better understand observations seen in practice. This paper considers the dual of a problem framework for hierarchical clustering introduced by Dasgupta. The main result is that one of the most popular algorithms used in practice, average linkage agglomerative clustering, has a small constant approximation ratio for this objective.

algorithm, approximation bound, hierarchical clustering, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Graph Clustering: Block-models and model free results

Wan, Yali, Meila, Marina

Neural Information Processing SystemsFeb-14-2020, 11:42:04 GMT

Clustering graphs under the Stochastic Block Model (SBM) and extensions are well studied. Guarantees of correctness exist under the assumption that the data is sampled from a model. In this paper, we propose a framework, in which we obtain "correctness" guarantees without assuming the data comes from a model. The guarantees we obtain depend instead on the statistics of the data that can be checked. We also show that this framework ties in with the existing model-based framework, and that we can exploit results in model-based recovery, as well as strengthen the results existing in that area of research.

block-model and model free result, correctness, graph clustering

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.40)

Add feedback

Graph Clustering With Missing Data: Convex Algorithms and Analysis

Vinayak, Ramya Korlakai, Oymak, Samet, Hassibi, Babak

Neural Information Processing SystemsFeb-14-2020, 11:26:58 GMT

We consider the problem of finding clusters in an unweighted graph, when the graph is partially observed. We analyze two programs, one which works for dense graphs and one which works for both sparse and dense graphs, but requires some a priori knowledge of the total cluster size, that are based on the convex optimization approach for low-rank matrix recovery using nuclear norm minimization. For the commonly used Stochastic Block Model, we obtain \emph{explicit} bounds on the parameters of the problem (size and sparsity of clusters, the amount of observed data) and the regularization parameter characterize the success and failure of the programs. We corroborate our theoretical findings through extensive simulations. We also run our algorithm on a real data set obtained from crowdsourcing an image classification task on the Amazon Mechanical Turk, and observe significant performance improvement over traditional methods such as k-means.

convex algorithm and analysis, graph clustering, missing data

Neural Information Processing Systems

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.40)

Add feedback

Hierarchical Clustering via Spreading Metrics

Roy, Aurko, Pokutta, Sebastian

Neural Information Processing SystemsFeb-14-2020, 11:12:27 GMT

We study the cost function for hierarchical clusterings introduced by [Dasgupta, 2015] where hierarchies are treated as first-class objects rather than deriving their cost from projections into flat clusters. It was also shown in [Dasgupta, 2015] that a top-down algorithm returns a hierarchical clustering of cost at most \(O\left(\alpha_n \log n\right)\) times the cost of the optimal hierarchical clustering, where \(\alpha_n\) is the approximation ratio of the Sparsest Cut subroutine used. Thus using the best known approximation algorithm for Sparsest Cut due to Arora-Rao-Vazirani, the top down algorithm returns a hierarchical clustering of cost at most \(O\left(\log {3/2} n\right)\) times the cost of the optimal solution. We improve this by giving an \(O(\log{n})\)-approximation algorithm for this problem. Our main technical ingredients are a combinatorial characterization of ultrametrics induced by this cost function, deriving an Integer Linear Programming (ILP) formulation for this family of ultrametrics, and showing how to iteratively round an LP relaxation of this formulation by using the idea of \emph{sphere growing} which has been extensively used in the context of graph partitioning.

algorithm return, hierarchical clustering, spreading metric, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Graphons, mergeons, and so on!

Eldridge, Justin, Belkin, Mikhail, Wang, Yusu

Neural Information Processing SystemsFeb-14-2020, 11:12:16 GMT

In this work we develop a theory of hierarchical clustering for graphs. Our modelling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties. Papers published at the Neural Information Processing Systems Conference.

algorithm, graphon, mergeon

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.71)

Add feedback

Adaptive Clustering through Semidefinite Programming

Royer, Martin

Neural Information Processing SystemsFeb-14-2020, 08:57:52 GMT

We analyze the clustering problem through a flexible probabilistic model that aims to identify an optimal partition on the sample X1,...,Xn. We perform exact clustering with high probability using a convex semidefinite estimator that interprets as a corrected, relaxed version of K-means. The estimator is analyzed through a non-asymptotic framework and showed to be optimal or near-optimal in recovering the partition. Furthermore, its performances are shown to be adaptive to the problem's effective dimension, as well as to K the unknown number of groups in this partition. We illustrate the method's performances in comparison to other classical clustering algorithms with numerical experiments on simulated high-dimensional data.

adaptive clustering, partition, semidefinite programming, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Fast Distributed k-Center Clustering with Outliers on Massive Data

Malkomes, Gustavo, Kusner, Matt J., Chen, Wenlin, Weinberger, Kilian Q., Moseley, Benjamin

Neural Information Processing SystemsFeb-14-2020, 08:26:09 GMT

Clustering large data is a fundamental problem with a vast number of applications. Due to the increasing size of data, practitioners interested in clustering have turned to distributed computation methods. In this work, we consider the widely used k-center clustering problem and its variant used to handle noisy data, k-center with outliers. In the noise-free setting we demonstrate how a previously-proposed distributed method is actually an O(1)-approximation algorithm, which accurately explains its strong empirical performance. Additionally, in the noisy setting, we develop a novel distributed algorithm that is also an O(1)-approximation.

algorithm, k-center clustering, massive data, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.45)

Add feedback

A Dirichlet Mixture Model of Hawkes Processes for Event Sequence Clustering

Xu, Hongteng, Zha, Hongyuan

Neural Information Processing SystemsFeb-14-2020, 07:44:03 GMT

How to cluster event sequences generated via different point processes is an interesting and important problem in statistical machine learning. To solve this problem, we propose and discuss an effective model-based clustering method based on a novel Dirichlet mixture model of a special but significant type of point processes --- Hawkes process. The proposed model generates the event sequences with different clusters from the Hawkes processes with different parameters, and uses a Dirichlet process as the prior distribution of the clusters. We prove the identifiability of our mixture model and propose an effective variational Bayesian inference algorithm to learn our model. An adaptive inner iteration allocation strategy is designed to accelerate the convergence of our algorithm.

dirichlet mixture model, event sequence clustering, hawke process, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Localized Data Fusion for Kernel k-Means Clustering with Application to Cancer Biology

Gönen, Mehmet, Margolin, Adam A.

Neural Information Processing SystemsFeb-14-2020, 07:27:11 GMT

In many modern applications from, for example, bioinformatics and computer vision, samples have multiple feature representations coming from different data sources. Multiview learning algorithms try to exploit all these available information to obtain a better learner in such scenarios. In this paper, we propose a novel multiple kernel learning algorithm that extends kernel k-means clustering to the multiview setting, which combines kernels calculated on the views in a localized way to better capture sample-specific characteristics of the data. We demonstrate the better performance of our localized data fusion approach on a human colon and rectal cancer data set by clustering patients. Our method finds more relevant prognostic patient groups than global data fusion methods when we evaluate the results with respect to three commonly used clinical biomarkers.

cancer biology, kernel k-means clustering, localized data fusion, (1 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Therapeutic Area > Oncology > Colorectal Cancer (0.31)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback