AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Planar Ultrametrics for Image Segmentation

Yarkony, Julian E., Fowlkes, Charless

Neural Information Processing SystemsDec-31-2015

We study the problem of hierarchical clustering on planar graphs. We formulate this in terms of finding the closest ultrametric to a specified set of distances and solve it using an LP relaxation that leverages minimum cost perfect matching as a subroutine to efficiently explore the space of planar partitions. We apply our algorithm to the problem of hierarchical image segmentation.

artificial intelligence, constraint, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.36)

Add feedback

k-Means Clustering Is Matrix Factorization

Bauckhage, Christian

arXiv.org Machine LearningDec-23-2015

We show that the objective function of conventional k-means clustering can be expressed as the Frobenius norm of the difference of a data matrix and a low rank approximation of that data matrix. In short, we show that k-means clustering is a matrix factorization problem. These notes are meant as a reference and intended to provide a guided tour towards a result that is often mentioned but seldom made explicit in the literature.

artificial intelligence, clustering, machine learning, (14 more...)

arXiv.org Machine Learning

1512.07548

Country: Europe (0.16)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.96)

Add feedback

A Comprehensive Approach to Mode Clustering

Chen, Yen-Chi, Genovese, Christopher R., Wasserman, Larry

arXiv.org Machine LearningDec-22-2015

Mode clustering is a nonparametric method for clustering that defines clusters using the basins of attraction of a density estimator's modes. We provide several enhancements to mode clustering: (i) a soft variant of cluster assignment, (ii) a measure of connectivity between clusters, (iii) a technique for choosing the bandwidth, (iv) a method for denoising small clusters, and (v) an approach to visualizing the clusters. Combining all these enhancements gives us a complete procedure for clustering in multivariate problems. We also compare mode clustering to other clustering methods in several examples

artificial intelligence, machine learning, survey article, (19 more...)

arXiv.org Machine Learning

1406.178

Country:

Europe > Italy (1.00)
North America > United States (0.93)

Genre:

Research Report (0.64)
Overview (0.46)

Industry: Materials > Chemicals (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Multilinear Subspace Clustering

Kernfeld, Eric, Majumder, Nathan, Aeron, Shuchin, Kilmer, Misha

arXiv.org Machine LearningDec-21-2015

ABSTRACT In this paper we present a new model and an algorithm for unsupervised clustering of 2-D data such as images. We assume that the data comes from a union of multilinear subspaces (UOMS) model, which is a specific structured case of the much studied union of subspaces (UOS) model. For segmentation under this model, we develop Multilinear Subspace Clustering (MSC) algorithm and evaluate its performance on the YaleB and Olivietti image data sets. We show that MSC is highly competitive with existing algorithms employing the UOS model in terms of clustering performance while enjoying improvement in computational complexity. Index Terms - subspace clustering, multilinear algebra, spectral clustering 1. INTRODUCTION Most clustering algorithms seek to detect disjoint clouds of data.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1512.0673

Country: North America > United States (0.94)

Genre: Research Report (0.50)

Industry: Media (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Clustering and Inference From Pairwise Comparisons

Wu, Rui, Xu, Jiaming, Srikant, R., Massoulié, Laurent, Lelarge, Marc, Hajek, Bruce

arXiv.org Machine LearningDec-17-2015

Given a set of pairwise comparisons, the classical ranking problem computes a single ranking that best represents the preferences of all users. In this paper, we study the problem of inferring individual preferences, arising in the context of making personalized recommendations. In particular, we assume that there are $n$ users of $r$ types; users of the same type provide similar pairwise comparisons for $m$ items according to the Bradley-Terry model. We propose an efficient algorithm that accurately estimates the individual preferences for almost all users, if there are $r \max \{m, n\}\log m \log^2 n$ pairwise comparisons per type, which is near optimal in sample complexity when $r$ only grows logarithmically with $m$ or $n$. Our algorithm has three steps: first, for each user, compute the \emph{net-win} vector which is a projection of its $\binom{m}{2}$-dimensional vector of pairwise comparisons onto an $m$-dimensional linear subspace; second, cluster the users based on the net-win vectors; third, estimate a single preference for each cluster separately. The net-win vectors are much less noisy than the high dimensional vectors of pairwise comparisons and clustering is more accurate after the projection as confirmed by numerical experiments. Moreover, we show that, when a cluster is only approximately correct, the maximum likelihood estimation for the Bradley-Terry model is still close to the true preference.

artificial intelligence, machine learning, vector, (20 more...)

arXiv.org Machine Learning

1502.04631

Country: North America > United States (0.46)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Dimensionality-reduced subspace clustering

Heckel, Reinhard, Tschannen, Michael, Bölcskei, Helmut

arXiv.org Machine LearningDec-13-2015

Subspace clustering refers to the problem of clustering unlabeled high-dimensional data points into a union of low-dimensional linear subspaces, whose number, orientations, and dimensions are all unknown. In practice one may have access to dimensionality-reduced observations of the data only, resulting, e.g., from undersampling due to complexity and speed constraints on the acquisition device or mechanism. More pertinently, even if the high-dimensional data set is available it is often desirable to first project the data points into a lower-dimensional space and to perform clustering there; this reduces storage requirements and computational cost. The purpose of this paper is to quantify the impact of dimensionality reduction through random projection on the performance of three subspace clustering algorithms, all of which are based on principles from sparse signal recovery. Specifically, we analyze the thresholding based subspace clustering (TSC) algorithm, the sparse subspace clustering (SSC) algorithm, and an orthogonal matching pursuit variant thereof (SSC-OMP). We find, for all three algorithms, that dimensionality reduction down to the order of the subspace dimensions is possible without incurring significant performance degradation. Moreover, these results are order-wise optimal in the sense that reducing the dimensionality further leads to a fundamentally ill-posed clustering problem. Our findings carry over to the noisy case as illustrated through analytical results for TSC and simulations for SSC and SSC-OMP. Extensive experiments on synthetic and real data complement our theoretical findings.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

1507.07105

Country: Europe (0.28)

Genre: Research Report > New Finding (0.87)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.54)

Add feedback

A Population Background for Nonparametric Density-Based Clustering

Chacón, José E.

arXiv.org Machine LearningDec-10-2015

Despite its popularity, it is widely recognized that the investigation of some theoretical aspects of clustering has been relatively sparse. One of the main reasons for this lack of theoretical results is surely the fact that, whereas for other statistical problems the theoretical population goal is clearly defined (as in regression or classification), for some of the clustering methodologies it is difficult to specify the population goal to which the data-based clustering algorithms should try to get close. This paper aims to provide some insight into the theoretical foundations of clustering by focusing on two main objectives: to provide an explicit formulation for the ideal population goal of the modal clustering methodology, which understands clusters as regions of high density; and to present two new loss functions, applicable in fact to any clustering methodology, to evaluate the performance of a data-based clustering algorithm with respect to the ideal population goal. In particular, it is shown that only mild conditions on a sequence of density estimators are needed to ensure that the sequence of modal clusterings that they induce is consistent.

artificial intelligence, machine learning, partition, (18 more...)

arXiv.org Machine Learning

doi: 10.1214/15-STS526

1408.1381

Country:

North America > United States (1.00)
Europe (0.67)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Convex Analysis of Mixtures for Separating Non-negative Well-grounded Sources

Zhu, Yitan, Wang, Niya, Miller, David J., Wang, Yue

arXiv.org Machine LearningDec-10-2015

Blind Source Separation (BSS) has proven to be a powerful tool for the analysis of composite patterns in engineering and science. We introduce Convex Analysis of Mixtures (CAM) for separating non-negative well-grounded sources, which learns the mixing matrix by identifying the lateral edges of the convex data scatter plot. We prove a sufficient and necessary condition for identifying the mixing matrix through edge detection, which also serves as the foundation for CAM to be applied not only to the exact-determined and over-determined cases, but also to the under-determined case. We show the optimality of the edge detection strategy, even for cases where source well-groundedness is not strictly satisfied. The CAM algorithm integrates plug-in noise filtering using sector-based clustering, an efficient geometric convex analysis scheme, and stability-based model order selection. We demonstrate the principle of CAM on simulated data and numerically mixed natural images. The superior performance of CAM against a panel of benchmark BSS techniques is demonstrated on numerically mixed gene expression data. We then apply CAM to dissect dynamic contrast-enhanced magnetic resonance imaging data taken from breast tumors and time-course microarray gene expression data derived from in-vivo muscle regeneration in mice, both producing biologically plausible decomposition results.

artificial intelligence, bioinformatics, machine learning, (17 more...)

arXiv.org Machine Learning

1406.7349

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)
Information Technology > Biomedical Informatics > Translational Bioinformatics (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Adjusting for Chance Clustering Comparison Measures

Romano, Simone, Vinh, Nguyen Xuan, Bailey, James, Verspoor, Karin

arXiv.org Machine LearningDec-3-2015

Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guidelines in the literature for their usage are sparse, with the result that users often resort to using both. Generalized Information Theoretic (IT) measures based on the Tsallis entropy have been shown to link pair-counting and Shannon IT measures. In this paper, we aim to bridge the gap between adjustment of measures based on pair-counting and measures based on information theory. We solve the key technical challenge of analytically computing the expected value and variance of generalized IT measures. This allows us to propose adjustments of generalized IT measures, which reduce to well known adjusted clustering comparison measures as special cases. Using the theory of generalized IT measures, we are able to propose the following guidelines for using ARI and AMI as external validation indices: ARI should be used when the reference clustering has large equal sized clusters; AMI should be used when the reference clustering is unbalanced and there exist small clusters.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1512.01286

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Clustering is Efficient for Approximate Maximum Inner Product Search

Auvolat, Alex, Chandar, Sarath, Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua

arXiv.org Machine LearningNov-29-2015

Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm. Specifically, we propose to train a spherical k-means, after having reduced the MIPS problem to a Maximum Cosine Similarity Search (MCSS). Experiments on two standard recommendation system benchmarks as well as on large vocabulary word embeddings, show that this simple approach yields much higher speedups, for the same retrieval precision, than current state-of-the-art hashing-based and tree-based methods. This simple method also yields more robust retrievals when the query is corrupted by noise.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

1507.0591

Country: North America > Canada > Quebec (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.87)

Add feedback