AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Statistical Inference Using Mean Shift Denoising

Xiang, Yunhua, Chen, Yen-Chi

arXiv.org Machine LearningOct-12-2016

In this paper, we study how the mean shift algorithm can be used to denoise a dataset. We introduce a new framework to analyze the mean shift algorithm as a denoising approach by viewing the algorithm as an operator on a distribution function. We investigate how the mean shift algorithm changes the distribution and show that data points shifted by the mean shift concentrate around high density regions of the underlying density function. By using the mean shift as a denoising method, we enhance the performance of several clustering techniques, improve the power of two-sample tests, and obtain a new method for anomaly detection.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

1610.03927

Country: Europe (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Condorcet's Jury Theorem for Consensus Clustering and its Implications for Diversity

Jain, Brijnesh J.

arXiv.org Machine LearningOct-10-2016

Condorcet's Jury Theorem has been invoked for ensemble classifiers to indicate that the combination of many classifiers can have better predictive performance than a single classifier. Such a theoretical underpinning is unknown for consensus clustering. This article extends Condorcet's Jury Theorem to the mean partition approach under the additional assumptions that a unique ground-truth partition exists and sample partitions are drawn from a sufficiently small ball containing the ground-truth. As an implication of practical relevance, we question the claim that the quality of consensus clustering depends on the diversity of the sample partitions. Instead, we conjecture that limiting the diversity of the mean partitions is necessary for controlling the quality.

artificial intelligence, machine learning, partition, (18 more...)

arXiv.org Machine Learning

1604.07711

Country: Europe (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering

Lesieur, Thibault, De Bacco, Caterina, Banks, Jess, Krzakala, Florent, Moore, Cris, Zdeborová, Lenka

arXiv.org Machine LearningOct-10-2016

Abstract-- We consider the problem of Gaussian mixture clustering in the high-dimensional limit where the data consists of m points in n dimensions, n, m and α m/n stays finite. Using exact but non-rigorous methods from statistical physics, we determine the critical value of α and the distance between the clusters at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. We also determine the accuracy achievable by the Bayes-optimal estimation algorithm. In particular, we find that when the number of clusters is sufficiently large, r 4 2 α, there is a gap between the threshold for informationtheoretically optimal performance and the threshold at which known algorithms succeed. Clustering m points in n-dimensional space is a ubiquitous problem in statistical inference and data science.

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Machine Learning

doi: 10.1109/ALLERTON.2016.7852287

1610.02918

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

A new selection strategy for selective cluster ensemble based on Diversity and Independency

Yousefnezhad, Muhammad, Reihanian, Ali, Zhang, Daoqiang, Minaei-Bidgoli, Behrouz

arXiv.org Machine LearningOct-9-2016

This research introduces a new strategy in cluster ensemble selection by using Independency and Diversity metrics. In recent years, Diversity and Quality, which are two metrics in evaluation procedure, have been used for selecting basic clustering results in the cluster ensemble selection. Although quality can improve the final results in cluster ensemble, it cannot control the procedures of generating basic results, which causes a gap in prediction of the generated basic results' accuracy. Instead of quality, this paper introduces Independency as a supplementary method to be used in conjunction with Diversity. Therefore, this paper uses a heuristic metric, which is based on the procedure of converting code to graph in Software Testing, in order to calculate the Independency of two basic clustering algorithms. Moreover, a new modeling language, which we called as "Clustering Algorithms Independency Language" (CAIL), is introduced in order to generate graphs which depict Independency of algorithms. Also, Uniformity, which is a new similarity metric, has been introduced for evaluating the diversity of basic results. As a credential, our experimental results on varied different standard data sets show that the proposed framework improves the accuracy of final results dramatically in comparison with other cluster ensemble methods.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1016/j.engappai.2016.10.005

1610.02649

Country: Asia (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Combining local and global smoothing in multivariate density estimation

Azzalini, Adelchi

arXiv.org Machine LearningOct-7-2016

Nonparametric estimation of a multivariate density estimation is tackled via a method which combines traditional local smoothing with a form of global smoothing but without imposing a rigid structure. Simulation work delivers encouraging indications on the effectiveness of the method. An application to density-based clustering illustrates a possible usage. Consider estimation of the probability density function f(·) of a continuous random variable in cases when a parametric formulation for f is not considered appropriate. Given a random sample drawn form f, a variety of nonparametric estimation methods are available.

artificial intelligence, machine learning, wkde 0, (17 more...)

arXiv.org Machine Learning

1610.02372

Country: Europe (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Non-Parametric Cluster Significance Testing with Reference to a Unimodal Null Distribution

Helgeson, Erika S., Bair, Eric

arXiv.org Machine LearningOct-5-2016

Cluster analysis is an unsupervised learning strategy that can be employed to identify subgroups of observations in data sets of unknown structure. This strategy is particularly useful for analyzing high-dimensional data such as microarray gene expression data. Many clustering methods are available, but it is challenging to determine if the identified clusters represent distinct subgroups. We propose a novel strategy to investigate the significance of identified clusters by comparing the within- cluster sum of squares from the original data to that produced by clustering an appropriate unimodal null distribution. The null distribution we present for this problem uses kernel density estimation and thus does not require that the data follow any particular distribution. We find that our method can accurately test for the presence of clustering even when the number of features is high.

artificial intelligence, cluster index, machine learning, (12 more...)

arXiv.org Machine Learning

1610.01424

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)

Add feedback

Hierarchical Clustering in R

#artificialintelligenceOct-1-2016, 22:45:41 GMT

In this post, I will show you how to do hierarchical clustering in R. We will use the iris dataset again, like we did for K means clustering. If you recall from the post about k means clustering, it requires us to specify the number of clusters, and finding the optimal number of clusters can often be hard. Hierarchical clustering is an alternative approach which builds a hierarchy from the bottom-up, and doesn't require us to specify the number of clusters beforehand. Once this is done, it is usually represented by a dendrogram like structure. Complete linkage and mean linkage clustering are the ones used most often.

artificial intelligence, hierarchical clustering, machine learning, (5 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Fast learning rates with heavy-tailed losses

Dinh, Vu, Ho, Lam Si Tung, Nguyen, Duy, Nguyen, Binh T.

arXiv.org Machine LearningSep-29-2016

We study fast learning rates when the losses are not necessarily bounded and may have a distribution with heavy tails. To enable such analyses, we introduce two new conditions: (i) the envelope function $\sup_{f \in \mathcal{F}}|\ell \circ f|$, where $\ell$ is the loss function and $\mathcal{F}$ is the hypothesis class, exists and is $L^r$-integrable, and (ii) $\ell$ satisfies the multi-scale Bernstein's condition on $\mathcal{F}$. Under these assumptions, we prove that learning rate faster than $O(n^{-1/2})$ can be obtained and, depending on $r$ and the multi-scale Bernstein's powers, can be arbitrarily close to $O(n^{-1})$. We then verify these assumptions and derive fast learning rates for the problem of vector quantization by $k$-means clustering with heavy-tailed distributions. The analyses enable us to obtain novel learning rates that extend and complement existing results in the literature from both theoretical and practical viewpoints.

artificial intelligence, bernstein, machine learning, (19 more...)

arXiv.org Machine Learning

1609.09481

Country: North America > United States > California (0.28)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Minimum Density Hyperplanes

Pavlidis, Nicos G., Hofmeyr, David P., Tasoulis, Sotiris K.

arXiv.org Machine LearningSep-28-2016

Associating distinct groups of objects (clusters) with contiguous regions of high probability density (high-density clusters), is central to many statistical and machine learning approaches to the classification of unlabelled data. We propose a novel hyperplane classifier for clustering and semi-supervised classification which is motivated by this objective. The proposed minimum density hyperplane minimises the integral of the empirical probability density function along it, thereby avoiding intersection with high density clusters. We show that the minimum density and the maximum margin hyperplanes are asymptotically equivalent, thus linking this approach to maximum margin clustering and semi-supervised support vector classifiers. We propose a projection pursuit formulation of the associated optimisation problem which allows us to find minimum density hyperplanes efficiently in practice, and evaluate its performance on a range of benchmark data sets. The proposed approach is found to be very competitive with state of the art methods for clustering and semi-supervised classification.

artificial intelligence, hyperplane, machine learning, (16 more...)

arXiv.org Machine Learning

1507.04201

Country: North America > United States (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.66)

Add feedback

StruClus: Structural Clustering of Large-Scale Graph Databases

Schäfer, Till, Mutzel, Petra

arXiv.org Machine LearningSep-28-2016

We present a structural clustering algorithm for large-scale datasets of small labeled graphs, utilizing a frequent subgraph sampling strategy. A set of representatives provides an intuitive description of each cluster, supports the clustering process, and helps to interpret the clustering results. The projection-based nature of the clustering approach allows us to bypass dimensionality and feature extraction problems that arise in the context of graph datasets reduced to pairwise distances or feature vectors. While achieving high quality and (human) interpretable clusterings, the runtime of the algorithm only grows linearly with the number of graphs. Furthermore, the approach is easy to parallelize and therefore suitable for very large datasets. Our extensive experimental evaluation on synthetic and real world datasets demonstrates the superiority of our approach over existing structural and subspace clustering algorithms, both, from a runtime and quality point of view.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

1609.09

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback