AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Machine Learning: An In-Depth, Non-Technical Guide -- Part 5 -- InnoArchiTech Innovation -- Data -- Technology -- Leadership

#artificialintelligenceMar-21-2016, 20:10:55 GMT

Originally published at innoarchitech.com here on March 18, 2016. Welcome to the fifth and final chapter in a five-part series about machine learning. In this final chapter, we will revisit unsupervised learning in greater depth, briefly discuss other fields related to machine learning, and finish the series with some examples of real-world machine learning applications. Recall that unsupervised learning involves learning from data, but without the goal of prediction. This is because the data is either not given with a target response variable (label), or one chooses not to designate a response.

artificial intelligence, machine learning, principal component, (12 more...)

#artificialintelligence

Country: North America > United States > Illinois > Cook County > Chicago (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Machine Learning: An In-Depth, Non-Technical Guide - Part 5

#artificialintelligenceMar-21-2016, 18:26:35 GMT

Welcome to the fifth and final chapter in a five-part series about machine learning. In this final chapter, we will revisit unsupervised learning in greater depth, briefly discuss other fields related to machine learning, and finish the series with some examples of real-world machine learning applications. Recall that unsupervised learning involves learning from data, but without the goal of prediction. This is because the data is either not given with a target response variable (label), or one chooses not to designate a response. It can also be used as a pre-processing step for supervised learning.

artificial intelligence, machine learning, principal component, (12 more...)

#artificialintelligence

Country: North America > United States > Illinois > Cook County > Chicago (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

K Means Clustering - Effect of random seed

@machinelearnbotMar-21-2016, 13:24:37 GMT

When the k-means clustering algorithm runs, it uses a randomly generated seed to determine the starting centroids of the clusters. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. An example for such a behavior is shown. R is used for the experiment. The code to load the data and the contents of the data are as follows. We try to group the samples based on two feature variables - age and bmi.

artificial intelligence, machine learning, means clustering, (4 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.99)

Add feedback

Data Clustering and Graph Partitioning via Simulated Mixing

Bhatti, Shahzad, Beck, Carolyn, Nedic, Angelia

arXiv.org Machine LearningMar-15-2016

Spectral clustering approaches have led to well-accepted algorithms for finding accurate clusters in a given dataset. However, their application to large-scale datasets has been hindered by computational complexity of eigenvalue decompositions. Several algorithms have been proposed in the recent past to accelerate spectral clustering, however they compromise on the accuracy of the spectral clustering to achieve faster speed. In this paper, we propose a novel spectral clustering algorithm based on a mixing process on a graph. Unlike the existing spectral clustering algorithms, our algorithm does not require computing eigenvectors. Specifically, it finds the equivalent of a linear combination of eigenvectors of the normalized similarity matrix weighted with corresponding eigenvalues. This linear combination is then used to partition the dataset into meaningful clusters. Simulations on real datasets show that partitioning datasets based on such linear combinations of eigenvectors achieves better accuracy than standard spectral clustering methods as the number of clusters increase. Our algorithm can easily be implemented in a distributed setting.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1603.04918

Country: North America > United States (0.68)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Efficient Clustering of Correlated Variables and Variable Selection in High-Dimensional Linear Models

Gauraha, Niharika, Parui, Swapan K.

arXiv.org Machine LearningMar-11-2016

In this paper, we introduce Adaptive Cluster Lasso(ACL) method for variable selection in high dimensional sparse regression models with strongly correlated variables. To handle correlated variables, the concept of clustering or grouping variables and then pursuing model fitting is widely accepted. When the dimension is very high, finding an appropriate group structure is as difficult as the original problem. The ACL is a three-stage procedure where, at the first stage, we use the Lasso(or its adaptive or thresholded version) to do initial selection, then we also include those variables which are not selected by the Lasso but are strongly correlated with the variables selected by the Lasso. At the second stage we cluster the variables based on the reduced set of predictors and in the third stage we perform sparse estimation such as Lasso on cluster representatives or the group Lasso based on the structures generated by clustering procedure. We show that our procedure is consistent and efficient in finding true underlying population group structure(under assumption of irrepresentable and beta-min conditions). We also study the group selection consistency of our method and we support the theory using simulated and pseudo-real dataset examples.

artificial intelligence, lasso, machine learning, (18 more...)

arXiv.org Machine Learning

1603.03724

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)

Add feedback

Bipartite Correlation Clustering -- Maximizing Agreements

Asteris, Megasthenis, Kyrillidis, Anastasios, Papailiopoulos, Dimitris, Dimakis, Alexandros G.

arXiv.org Machine LearningMar-9-2016

In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph $G$ with `+' and `-' edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+' edges within clusters plus all `-' edges cut across clusters. BCC is known to be NP-hard. We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the number of clusters. Our algorithm outputs a $k$-clustering that provably achieves a number of agreements within a multiplicative ${(1-\delta)}$-factor from the optimal, for any desired accuracy $\delta$. It relies on solving a combinatorially constrained bilinear maximization on the bi-adjacency matrix of $G$. It runs in time exponential in $k$ and $\delta^{-1}$, but linear in the size of the input. Further, we show that, in the (unconstrained) BCC setting, an ${(1-\delta)}$-approximation can be achieved by $O(\delta^{-1})$ clusters regardless of the size of the graph. In turn, our $k$-BCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

1603.02782

Country: North America > United States > Texas (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

A Bayesian non-parametric method for clustering high-dimensional binary data

Santra, Tapesh

arXiv.org Machine LearningMar-8-2016

In many real life problems, objects are described by large number of binary features. For instance, documents are characterized by presence or absence of certain keywords; cancer patients are characterized by presence or absence of certain mutations etc. In such cases, grouping together similar objects/profiles based on such high dimensional binary features is desirable, but challenging. Here, I present a Bayesian non parametric algorithm for clustering high dimensional binary data. It uses a Dirichlet Process (DP) mixture model and simulated annealing to not only cluster binary data, but also find optimal number of clusters in the data. The performance of the algorithm was evaluated and compared with other algorithms using simulated datasets. It outperformed all other clustering methods that were tested in the simulation studies. It was also used to cluster real datasets arising from document analysis, handwritten image analysis and cancer research. It successfully divided a set of documents based on their topics, hand written images based on different styles of writing digits and identified tissue and mutation specificity of chemotherapy treatments.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1603.02494

Country: Europe (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Efficient Multiscale Gaussian Process Regression using Hierarchical Clustering

Zhang, Z., Duraisamy, K., Gumerov, N. A.

arXiv.org Machine LearningMar-6-2016

Machine Learning manuscript No. (will be inserted by the editor) Abstract Standard Gaussian Process (GP) regression, a powerful machine learning tool, is computationally expensive when it is applied to large datasets, and potentially inaccurate when data points are sparsely distributed in a highdimensional feature space. To address these challenges, a new multiscale, sparsified GP algorithm is formulated, with the goal of application to large scientific computing datasets. In this approach, the data is partitioned into clusters and the cluster centers are used to define a reduced training set, resulting in an improvement over standard GPs in terms of training and evaluation costs. Further, a hierarchical technique is used to adaptively map the local covariance representation to the underlying sparsity of the feature space, leading to improved prediction accuracy when the data distribution is highly nonuniform. A theoretical investigation of the computational complexity of the algorithm is presented. The efficacy of this method is then demonstrated on smooth and discontinuous analytical functions and on data from a direct numerical simulation of turbulent combustion. Keywords Gaussian Processes, Sparse regression, Clustering. 1 Introduction The rapid growth in computing power has resulted in the generation of massive amounts of highly-resolved datasets in many fields of science and engineering.

artificial intelligence, machine learning, training point, (14 more...)

arXiv.org Machine Learning

1511.02258

Country: North America > United States > Michigan (0.28)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.83)

Add feedback

Clustering by Hierarchical Nearest Neighbor Descent (H-NND)

Qiu, Teng, Li, Yongjie

arXiv.org Machine LearningMar-4-2016

Previously in 2014, we proposed the Nearest Descent (ND) method, capable of generating an efficient Graph, called the in-tree (IT). Due to some beautiful and effective features, this IT structure proves well suited for data clustering. Although there exist some redundant edges in IT, they usually have salient features and thus it is not hard to remove them. Subsequently, in order to prevent the seemingly redundant edges from occurring, we proposed the Nearest Neighbor Descent (NND) by adding the "Neighborhood" constraint on ND. Consequently, clusters automatically emerged, without the additional requirement of removing the redundant edges. However, NND proved still not perfect, since it brought in a new yet worse problem, the "over-partitioning" problem. Now, in this paper, we propose a method, called the Hierarchical Nearest Neighbor Descent (H-NND), which overcomes the over-partitioning problem of NND via using the hierarchical strategy. Specifically, H-NND uses ND to effectively merge the over-segmented sub-graphs or clusters that NND produces. Like ND, H-NND also generates the IT structure, in which the redundant edges once again appear. This seemingly comes back to the situation that ND faces. However, compared with ND, the redundant edges in the IT structure generated by H-NND generally become more salient, thus being much easier and more reliable to be identified even by the simplest edge-removing method which takes the edge length as the only measure. In other words, the IT structure constructed by H-NND becomes more fitted for data clustering. We prove this on several clustering datasets of varying shapes, dimensions and attributes. Besides, compared with ND, H-NND generally takes less computation time to construct the IT data structure for the input data.

artificial intelligence, machine learning, node, (15 more...)

arXiv.org Machine Learning

1509.02805

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Multivariate response and parsimony for Gaussian cluster-weighted models

Dang, Utkarsh J., Punzo, Antonio, McNicholas, Paul D., Ingrassia, Salvatore, Browne, Ryan P.

arXiv.org Machine LearningFeb-26-2016

A family of parsimonious Gaussian cluster-weighted models is presented. This family concerns a multivariate extension to cluster-weighted modelling that can account for correlations between multivariate responses. Parsimony is attained by constraining parts of an eigen-decomposition imposed on the component covariance matrices. A sufficient condition for identifiability is provided and an expectation-maximization algorithm is presented for parameter estimation. Model performance is investigated on both synthetic and classical real data sets and compared with some popular approaches. Finally, accounting for linear dependencies in the presence of a linear regression structure is shown to offer better performance, vis-\`{a}-vis clustering, over existing methodologies.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Machine Learning

1411.056

Country:

North America > United States (1.00)
Europe (0.67)
North America > Canada > Ontario (0.46)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback