AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Subspace Clustering via Thresholding and Spectral Clustering

Heckel, Reinhard, Bölcskei, Helmut

arXiv.org Machine LearningMar-15-2013

We consider the problem of clustering a set of high-dimensional data points into sets of low-dimensional linear subspaces. The number of subspaces, their dimensions, and their orientations are unknown. We propose a simple and low-complexity clustering algorithm based on thresholding the correlations between the data points followed by spectral clustering. A probabilistic performance analysis shows that this algorithm succeeds even when the subspaces intersect, and when the dimensions of the subspaces scale (up to a log-factor) linearly in the ambient dimension. Moreover, we prove that the algorithm also succeeds for data points that are subject to erasures with the number of erasures scaling (up to a log-factor) linearly in the ambient dimension. Finally, we propose a simple scheme that provably detects outliers.

artificial intelligence, machine learning, subspace, (12 more...)

arXiv.org Machine Learning

1303.3716

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback

Improved Performance of Unsupervised Method by Renovated K-Means

Ashok, P., Nawaz, G. M Kadhar, Elayaraja, E., Vadivel, V.

arXiv.org Machine LearningMar-11-2013

Clustering is a separation of data into groups of similar objects. Every group called cluster consists of objects that are similar to one another and dissimilar to objects of other groups. In this paper, the K-Means algorithm is implemented by three distance functions and to identify the optimal distance function for clustering methods. The proposed K-Means algorithm is compared with K-Means, Static Weighted K-Means (SWK-Means) and Dynamic Weighted K-Means (DWK-Means) algorithm by using Davis Bouldin index, Execution Time and Iteration count methods. Experimental results show that the proposed K-Means algorithm performed better on Iris and Wine dataset when compared with other three clustering methods.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Machine Learning

1304.0725

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Clustering on Multi-Layer Graphs via Subspace Analysis on Grassmann Manifolds

Dong, Xiaowen, Frossard, Pascal, Vandergheynst, Pierre, Nefedov, Nikolai

arXiv.org Machine LearningMar-9-2013

Relationships between entities in datasets are often of multiple nature, like geographical distance, social relationships, or common interests among people in a social network, for example. This information can naturally be modeled by a set of weighted and undirected graphs that form a global multilayer graph, where the common vertex set represents the entities and the edges on different layers capture the similarities of the entities in term of the different modalities. In this paper, we address the problem of analyzing multi-layer graphs and propose methods for clustering the vertices by efficiently merging the information provided by the multiple modalities. To this end, we propose to combine the characteristics of individual graph layers using tools from subspace analysis on a Grassmann manifold. The resulting combination can then be viewed as a low dimensional representation of the original data which preserves the most important information from diverse relationships between entities. We use this information in new clustering methods and test our algorithm on several synthetic and real world datasets where we demonstrate superior or competitive performances compared to baseline and state-of-the-art techniques. Our generic framework further extends to numerous analysis and learning problems that involve different types of information on graphs.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

doi: 10.1109/TSP.2013.2295553

1303.2221

Country:

North America (0.46)
Europe > Switzerland (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Education (0.34)
Information Technology (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Large-Margin Metric Learning for Partitioning Problems

Lajugie, Rémi, Arlot, Sylvain, Bach, Francis

arXiv.org Machine LearningMar-6-2013

In this paper, we consider unsupervised partitioning problems, such as clustering, image segmentation, video segmentation and other change-point detection problems. We focus on partitioning problems based explicitly or implicitly on the minimization of Euclidean distortions, which include mean-based change-point detection, K-means, spectral clustering and normalized cuts. Our main goal is to learn a Mahalanobis metric for these unsupervised problems, leading to feature weighting and/or selection. This is done in a supervised way by assuming the availability of several potentially partially labelled datasets that share the same metric. We cast the metric learning problem as a large-margin structured prediction problem, with proper definition of regularizers and losses, leading to a convex optimization problem which can be solved efficiently with iterative techniques. We provide experiments where we show how learning the metric may significantly improve the partitioning performance in synthetic examples, bioinformatics, video segmentation and image segmentation problems.

artificial intelligence, machine learning, partition, (14 more...)

arXiv.org Machine Learning

1303.128

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.46)
Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

Bayesian Consensus Clustering

Lock, Eric F., Dunson, David B.

arXiv.org Machine LearningFeb-28-2013

The task of clustering a set of objects based on multiple sources of data arises in several modern applications. We propose an integrative statistical model that permits a separate clustering of the objects for each data source. These separate clusterings adhere loosely to an overall consensus clustering, and hence they are not independent. We describe a computationally scalable Bayesian framework for simultaneous estimation of both the consensus clustering and the source-specific clusterings. We demonstrate that this flexible approach is more robust than joint clustering of all data sources, and is more powerful than clustering each data source separately. This work is motivated by the integrated analysis of heterogeneous biomedical data, and we present an application to subtype identification of breast cancer tumor samples using publicly available data from The Cancer Genome Atlas. Several fields of research now analyze multi-source data (also called multimodal data), in which multiple heterogeneous datasets describe a common set of objects.

artificial intelligence, data source, machine learning, (20 more...)

arXiv.org Machine Learning

doi: 10.1093/bioinformatics/btt425

1302.728

Country:

North America > United States (1.00)
Europe (0.93)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

Metrics for Multivariate Dictionaries

Chevallier, Sylvain, Barthélemy, Quentin, Atif, Jamal

arXiv.org Machine LearningFeb-26-2013

Overcomplete representations and dictionary learning algorithms kept attracting a growing interest in the machine learning community. This paper addresses the emerging problem of comparing multivariate overcomplete representations. Despite a recurrent need to rely on a distance for learning or assessing multivariate overcomplete representations, no metrics in their underlying spaces have yet been proposed. Henceforth we propose to study overcomplete representations from the perspective of frame theory and matrix manifolds. We consider distances between multivariate dictionaries as distances between their spans which reveal to be elements of a Grassmannian manifold. We introduce Wasserstein-like set-metrics defined on Grassmannian spaces and study their properties both theoretically and numerically. Indeed a deep experimental study based on tailored synthetic datasetsand real EEG signals for Brain-Computer Interfaces (BCI) have been conducted. In particular, the introduced metrics have been embedded in clustering algorithm and applied to BCI Competition IV-2a for dataset quality assessment. Besides, a principled connection is made between three close but still disjoint research fields, namely, Grassmannian packing, dictionary learning and compressed sensing.

algorithm, dataset, dictionary learning algorithm, (13 more...)

arXiv.org Machine Learning

doi: 10.1109/ICASSP.2014.6854993

1302.4242

Country:

Europe > France (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

A Conformal Prediction Approach to Explore Functional Data

Lei, Jing, Rinaldo, Alessandro, Wasserman, Larry

arXiv.org Machine LearningFeb-26-2013

This paper applies conformal prediction techniques to compute simultaneous prediction bands and clustering trees for functional data. These tools can be used to detect outliers and clusters. Both our prediction bands and clustering trees provide prediction sets for the underlying stochastic process with a guaranteed finite sample behavior, under no distributional assumptions. The prediction sets are also informative in that they correspond to the high density region of the underlying process. While ordinary conformal prediction has high computational cost for functional data, we use the inductive conformal predictor, together with several novel choices of conformity scores, to simplify the computation. Our methods are illustrated on some real data examples.

artificial intelligence, functional data, machine learning, (15 more...)

arXiv.org Machine Learning

1302.6452

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Generating Extractive Summaries of Scientific Paradigms

Qazvinian, V., Radev, D. R., Mohammad, S. M., Dorr, B., Zajic, D., Whidby, M., Moon, T.

Journal of Artificial Intelligence ResearchFeb-20-2013

Researchers and scientists increasingly find themselves in the position of having to quickly understand large amounts of technical material. Our goal is to effectively serve this need by using bibliometric text mining and summarization techniques to generate summaries of scientific literature. We show how we can use citations to produce automatically generated, readily consumable, technical extractive summaries. We first propose C-LexRank, a model for summarizing single scientific articles based on citations, which employs community detection and extracts salient information-rich sentences. Next, we further extend our experiments to summarize a set of papers, which cover the same scientific topic. We generate extractive summaries of a set of Question Answering (QA) and Dependency Parsing (DP) papers, their abstracts, and their citation sentences and show that citations have unique information amenable to creating a summary.

citation sentence, factoid, proceedings, (13 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.3732

AI Access Foundation

10800

Journal of Artificial Intelligence Research

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > Maryland > Prince George's County > College Park (0.14)
(21 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.68)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Spectral Clustering with Unbalanced Data

Qian, Jing, Saligrama, Venkatesh

arXiv.org Machine LearningFeb-20-2013

Spectral clustering (SC) and graph-based semi-supervised learning (SSL) algorithms are sensitive to how graphs are constructed from data. In particular if the data has proximal and unbalanced clusters these algorithms can lead to poor performance on well-known graphs such as $k$-NN, full-RBF, $\epsilon$-graphs. This is because the objectives such as Ratio-Cut (RCut) or normalized cut (NCut) attempt to tradeoff cut values with cluster sizes, which are not tailored to unbalanced data. We propose a novel graph partitioning framework, which parameterizes a family of graphs by adaptively modulating node degrees in a $k$-NN graph. We then propose a model selection scheme to choose sizable clusters which are separated by smallest cut values. Our framework is able to adapt to varying levels of unbalancedness of data and can be naturally used for small cluster detection. We theoretically justify our ideas through limit cut analysis. Unsupervised and semi-supervised experiments on synthetic and real data sets demonstrate the superiority of our method.

artificial intelligence, graph, machine learning, (17 more...)

arXiv.org Machine Learning

1302.5134

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

Breaking the Small Cluster Barrier of Graph Clustering

Ailon, Nir, Chen, Yudong, Huan, Xu

arXiv.org Machine LearningFeb-20-2013

This paper considers a classic problem in machine learning and theoretical computer science, namely graph clustering, i.e., given an undirected unweighted graph, partition the nodes into disjoint clusters, so that the density of edges within one cluster is higher than those across clusters. Graph clustering arises naturally in many application across science and engineering. Some prominent examples include community detection in social network Mishra et al. [2007], submarket identification in E-commerce and sponsored search Yahoo!-Inc [2009], and co-authorship analysis in analyzing document database Ester et al. [1995], among others. From a purely binary classification theoretical point of view, the edges of the graph are (noisy) labels of similarity or affinity between pairs of objects, and the concept class consists of clusterings of the objects (encoded graphically by identifying clusters with cliques). Many theoretical results in graph clustering [e.g., Boppana, 1987, Chen et al., 2012, McSherry, 2001] consider the planted partition model, in which the edges are generated randomly; see Section 1.1 for more details. While numerous different methods have been proposed, their performance guarantees all share the following manner - under certain condition of the density of edges (within clusters and across clusters), the proposed method succeeds to recover the correct clusters exactly if all clusters are larger than a threshold size, typically Ω( n).

artificial intelligence, machine learning, probability, (19 more...)

arXiv.org Machine Learning

1302.4549

Country: Asia (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback