AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Cats & Co: Categorical Time Series Coclustering

Gay, Dominique, Guigourès, Romain, Boullé, Marc, Clérot, Fabrice

arXiv.org Machine LearningMay-6-2015

We suggest a novel method of clustering and exploratory analysis of temporal event sequences data (also known as categorical time series) based on three-dimensional data grid models. A data set of temporal event sequences can be represented as a data set of three-dimensional points, each point is defined by three variables: a sequence identifier, a time value and an event value. Instantiating data grid models to the 3D-points turns the problem into 3D-coclustering. The sequences are partitioned into clusters, the time variable is discretized into intervals and the events are partitioned into clusters. The cross-product of the univariate partitions forms a multivariate partition of the representation space, i.e., a grid of cells and it also represents a nonparametric estimator of the joint distribution of the sequences, time and events dimensions. Thus, the sequences are grouped together because they have similar joint distribution of time and events, i.e., similar distribution of events along the time dimension. The best data grid is computed using a parameter-free Bayesian model selection approach. We also suggest several criteria for exploiting the resulting grid through agglomerative hierarchies, for interpreting the clusters of sequences and characterizing their components through insightful visualizations. Extensive experiments on both synthetic and real-world data sets demonstrate that data grid models are efficient, effective and discover meaningful underlying patterns of categorical time series data.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

1505.013

Country:

North America > United States (1.00)
Europe (0.67)

Genre: Research Report (0.84)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.87)

Add feedback

Self-Expressive Decompositions for Matrix Approximation and Clustering

Dyer, Eva L., Goldstein, Tom A., Patel, Raajen, Kording, Konrad P., Baraniuk, Richard G.

arXiv.org Machine LearningMay-4-2015

Data-aware methods for dimensionality reduction and matrix decomposition aim to find low-dimensional structure in a collection of data. Classical approaches discover such structure by learning a basis that can efficiently express the collection. Recently, "self expression", the idea of using a small subset of data vectors to represent the full collection, has been developed as an alternative to learning. Here, we introduce a scalable method for computing sparse SElf-Expressive Decompositions (SEED). SEED is a greedy method that constructs a basis by sequentially selecting incoherent vectors from the dataset. After forming a basis from a subset of vectors in the dataset, SEED then computes a sparse representation of the dataset with respect to this basis. We develop sufficient conditions under which SEED exactly represents low rank matrices and vectors sampled from a unions of independent subspaces. We show how SEED can be used in applications ranging from matrix approximation and denoising to clustering, and apply it to numerous real-world datasets. Our results demonstrate that SEED is an attractive low-complexity alternative to other sparse matrix factorization approaches such as sparse PCA and self-expressive methods for clustering.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1505.00824

Country: North America > United States > Maryland (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Risk Bounds For Mode Clustering

Azizyan, Martin, Chen, Yen-Chi, Singh, Aarti, Wasserman, Larry

arXiv.org Machine LearningMay-3-2015

Density mode clustering is a nonparametric clustering method. The clusters are the basins of attraction of the modes of a density estimator. We study the risk of mode-based clustering. We show that the clustering risk over the cluster cores --- the regions where the density is high --- is very small even in high dimensions. And under a low noise condition, the overall cluster risk is small even beyond the cores, in high dimensions.

boundary, critical point, dest, (13 more...)

arXiv.org Machine Learning

1505.00482

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.15)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
Europe > Netherlands > South Holland > Dordrecht (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Kernel Spectral Clustering and applications

Langone, Rocco, Mall, Raghvendra, Alzate, Carlos, Suykens, Johan A. K.

arXiv.org Machine LearningMay-3-2015

In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and $L_0$, $L_1, L_0 + L_1$, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.

artificial intelligence, kernel spectral clustering, machine learning, (14 more...)

arXiv.org Machine Learning

1505.00477

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Topic Extraction and Bundling of Related Scientific Articles

Parambath, Shameem A Puthiya

arXiv.org Machine LearningMay-1-2015

Automatic classification of scientific articles based on common characteristics is an interesting problem with many applications in digital library and information retrieval systems. Properly organized articles can be useful for automatic generation of taxonomies in scientific writings, textual summarization, efficient information retrieval etc. Generating article bundles from a large number of input articles, based on the associated features of the articles is tedious and computationally expensive task. In this report we propose an automatic two-step approach for topic extraction and bundling of related articles from a set of scientific articles in real-time. For topic extraction, we make use of Latent Dirichlet Allocation (LDA) topic modeling techniques and for bundling, we make use of hierarchical agglomerative clustering techniques. We run experiments to validate our bundling semantics and compare it with existing models in use. We make use of an online crowdsourcing marketplace provided by Amazon called Amazon Mechanical Turk to carry out experiments. We explain our experimental setup and empirical results in detail and show that our method is advantageous over existing ones.

information retrieval, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

1212.5423

Country: Europe > Sweden (0.29)

Genre: Questionnaire & Opinion Survey (0.97)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Nonparametric Nearest Neighbor Random Process Clustering

Tschannen, Michael, Bölcskei, Helmut

arXiv.org Machine LearningApr-20-2015

We consider the problem of clustering noisy finite-length observations of stationary ergodic random processes according to their nonparametric generative models without prior knowledge of the model statistics and the number of generative models. Two algorithms, both using the L1-distance between estimated power spectral densities (PSDs) as a measure of dissimilarity, are analyzed. The first algorithm, termed nearest neighbor process clustering (NNPC), to the best of our knowledge, is new and relies on partitioning the nearest neighbor graph of the observations via spectral clustering. The second algorithm, simply referred to as k-means (KM), consists of a single k-means iteration with farthest point initialization and was considered before in the literature, albeit with a different measure of dissimilarity and with asymptotic performance results only. We show that both NNPC and KM succeed with high probability under noise and even when the generative process PSDs overlap significantly, all provided that the observation length is sufficiently large. Our results quantify the tradeoff between the overlap of the generative process PSDs, the noise variance, and the observation length. Finally, we present numerical performance results for synthetic and real data.

artificial intelligence, generative model, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1109/ISIT.2015.7282647

1504.05059

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Probabilistic Clustering of Time-Evolving Distance Data

Vogt, Julia E., Kloft, Marius, Stark, Stefan, Raman, Sudhir S., Prabhakaran, Sandhya, Roth, Volker, Rätsch, Gunnar

arXiv.org Machine LearningApr-14-2015

We present a novel probabilistic clustering model for objects that are represented via pairwise distances and observed at different time points. The proposed method utilizes the information given by adjacent time points to find the underlying cluster structure and obtain a smooth cluster evolution. This approach allows the number of objects and clusters to differ at every time point, and no identification on the identities of the objects is needed. Further, the model does not require the number of clusters being specified in advance -- they are instead determined automatically using a Dirichlet process prior. We validate our model on synthetic data showing that the proposed method is more accurate than state-of-the-art clustering methods. Finally, we use our dynamic clustering model to analyze and illustrate the evolution of brain cancer patients over time.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

1504.03701

Country:

North America > United States (0.93)
Europe (0.68)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Relax, no need to round: integrality of clustering formulations

Awasthi, Pranjal, Bandeira, Afonso S., Charikar, Moses, Krishnaswamy, Ravishankar, Villar, Soledad, Ward, Rachel

arXiv.org Machine LearningApr-14-2015

We study exact recovery conditions for convex relaxations of point cloud clustering problems, focusing on two of the most common optimization problems for unsupervised clustering: $k$-means and $k$-median clustering. Motivations for focusing on convex relaxations are: (a) they come with a certificate of optimality, and (b) they are generic tools which are relatively parameter-free, not tailored to specific assumptions over the input. More precisely, we consider the distributional setting where there are $k$ clusters in $\mathbb{R}^m$ and data from each cluster consists of $n$ points sampled from a symmetric distribution within a ball of unit radius. We ask: what is the minimal separation distance between cluster centers needed for convex relaxations to exactly recover these $k$ clusters as the optimal integral solution? For the $k$-median linear programming relaxation we show a tight bound: exact recovery is obtained given arbitrarily small pairwise separation $\epsilon > 0$ between the balls. In other words, the pairwise center separation is $\Delta > 2+\epsilon$. Under the same distributional model, the $k$-means LP relaxation fails to recover such clusters at separation as large as $\Delta = 4$. Yet, if we enforce PSD constraints on the $k$-means LP, we get exact cluster recovery at center separation $\Delta > 2\sqrt2(1+\sqrt{1/m})$. In contrast, common heuristics such as Lloyd's algorithm (a.k.a. the $k$-means algorithm) can fail to recover clusters in this setting; even with arbitrarily large cluster separation, k-means++ with overseeding by any constant factor fails with high probability at exact cluster recovery. To complement the theoretical analysis, we provide an experimental study of the recovery guarantees for these various methods, and discuss several open problems which these experiments suggest.

artificial intelligence, machine learning, relaxation, (18 more...)

arXiv.org Machine Learning

1408.4045

Country: North America > United States (0.14)

Genre:

Research Report > New Finding (0.87)
Research Report > Experimental Study (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

The geometry of kernelized spectral clustering

Schiebinger, Geoffrey, Wainwright, Martin J., Yu, Bin

arXiv.org Machine LearningApr-7-2015

Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture of nonparametric distributions. The difficulty of this label recovery problem depends on the overlap between mixture components and how easily a mixture component is divided into two nonoverlapping components. When the overlap is small compared to the indivisibility of the mixture components, the principal eigenspace of the population-level normalized Laplacian operator is approximately spanned by the square-root kernelized component densities. In the finite sample setting, and under the same assumption, embedded samples from different components are approximately orthogonal with high probability when the sample size is large. As a corollary we control the fraction of samples mislabeled by spectral clustering under finite mixtures with nonparametric components.

artificial intelligence, machine learning, mixture component, (16 more...)

arXiv.org Machine Learning

doi: 10.1214/14-AOS1283

1404.7552

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Efficient Dictionary Learning via Very Sparse Random Projections

Pourkamali-Anaraki, Farhad, Becker, Stephen, Hughes, Shannon M.

arXiv.org Machine LearningApr-5-2015

Performing signal processing tasks on compressive measurements of data has received great attention in recent years. In this paper, we extend previous work on compressive dictionary learning by showing that more general random projections may be used, including sparse ones. More precisely, we examine compressive K-means clustering as a special case of compressive dictionary learning and give theoretical guarantees for its performance for a very general class of random projections. We then propose a memory and computation efficient dictionary learning algorithm, specifically designed for analyzing large volumes of high-dimensional data, which learns the dictionary from very sparse random projections. Experimental results demonstrate that our approach allows for reduction of computational complexity and memory/data access, with controllable loss in accuracy.

artificial intelligence, compressive measurement, machine learning, (15 more...)

arXiv.org Machine Learning

1504.01169

Country: North America > United States > Colorado > Boulder County > Boulder (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback