Clustering
Machine Learning: An In-Depth, Non-Technical Guide -- Part 5 -- InnoArchiTech Innovation -- Data -- Technology -- Leadership
Originally published at innoarchitech.com here on March 18, 2016. Welcome to the fifth and final chapter in a five-part series about machine learning. In this final chapter, we will revisit unsupervised learning in greater depth, briefly discuss other fields related to machine learning, and finish the series with some examples of real-world machine learning applications. Recall that unsupervised learning involves learning from data, but without the goal of prediction. This is because the data is either not given with a target response variable (label), or one chooses not to designate a response.
Machine Learning: An In-Depth, Non-Technical Guide - Part 5
Welcome to the fifth and final chapter in a five-part series about machine learning. In this final chapter, we will revisit unsupervised learning in greater depth, briefly discuss other fields related to machine learning, and finish the series with some examples of real-world machine learning applications. Recall that unsupervised learning involves learning from data, but without the goal of prediction. This is because the data is either not given with a target response variable (label), or one chooses not to designate a response. It can also be used as a pre-processing step for supervised learning.
K Means Clustering - Effect of random seed
When the k-means clustering algorithm runs, it uses a randomly generated seed to determine the starting centroids of the clusters. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. An example for such a behavior is shown. R is used for the experiment. The code to load the data and the contents of the data are as follows. We try to group the samples based on two feature variables - age and bmi.
Data Clustering and Graph Partitioning via Simulated Mixing
Bhatti, Shahzad, Beck, Carolyn, Nedic, Angelia
Spectral clustering approaches have led to well-accepted algorithms for finding accurate clusters in a given dataset. However, their application to large-scale datasets has been hindered by computational complexity of eigenvalue decompositions. Several algorithms have been proposed in the recent past to accelerate spectral clustering, however they compromise on the accuracy of the spectral clustering to achieve faster speed. In this paper, we propose a novel spectral clustering algorithm based on a mixing process on a graph. Unlike the existing spectral clustering algorithms, our algorithm does not require computing eigenvectors. Specifically, it finds the equivalent of a linear combination of eigenvectors of the normalized similarity matrix weighted with corresponding eigenvalues. This linear combination is then used to partition the dataset into meaningful clusters. Simulations on real datasets show that partitioning datasets based on such linear combinations of eigenvectors achieves better accuracy than standard spectral clustering methods as the number of clusters increase. Our algorithm can easily be implemented in a distributed setting.
Efficient Clustering of Correlated Variables and Variable Selection in High-Dimensional Linear Models
Gauraha, Niharika, Parui, Swapan K.
In this paper, we introduce Adaptive Cluster Lasso(ACL) method for variable selection in high dimensional sparse regression models with strongly correlated variables. To handle correlated variables, the concept of clustering or grouping variables and then pursuing model fitting is widely accepted. When the dimension is very high, finding an appropriate group structure is as difficult as the original problem. The ACL is a three-stage procedure where, at the first stage, we use the Lasso(or its adaptive or thresholded version) to do initial selection, then we also include those variables which are not selected by the Lasso but are strongly correlated with the variables selected by the Lasso. At the second stage we cluster the variables based on the reduced set of predictors and in the third stage we perform sparse estimation such as Lasso on cluster representatives or the group Lasso based on the structures generated by clustering procedure. We show that our procedure is consistent and efficient in finding true underlying population group structure(under assumption of irrepresentable and beta-min conditions). We also study the group selection consistency of our method and we support the theory using simulated and pseudo-real dataset examples.
Bipartite Correlation Clustering -- Maximizing Agreements
Asteris, Megasthenis, Kyrillidis, Anastasios, Papailiopoulos, Dimitris, Dimakis, Alexandros G.
In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph $G$ with `+' and `-' edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+' edges within clusters plus all `-' edges cut across clusters. BCC is known to be NP-hard. We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the number of clusters. Our algorithm outputs a $k$-clustering that provably achieves a number of agreements within a multiplicative ${(1-\delta)}$-factor from the optimal, for any desired accuracy $\delta$. It relies on solving a combinatorially constrained bilinear maximization on the bi-adjacency matrix of $G$. It runs in time exponential in $k$ and $\delta^{-1}$, but linear in the size of the input. Further, we show that, in the (unconstrained) BCC setting, an ${(1-\delta)}$-approximation can be achieved by $O(\delta^{-1})$ clusters regardless of the size of the graph. In turn, our $k$-BCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements.
A Bayesian non-parametric method for clustering high-dimensional binary data
In many real life problems, objects are described by large number of binary features. For instance, documents are characterized by presence or absence of certain keywords; cancer patients are characterized by presence or absence of certain mutations etc. In such cases, grouping together similar objects/profiles based on such high dimensional binary features is desirable, but challenging. Here, I present a Bayesian non parametric algorithm for clustering high dimensional binary data. It uses a Dirichlet Process (DP) mixture model and simulated annealing to not only cluster binary data, but also find optimal number of clusters in the data. The performance of the algorithm was evaluated and compared with other algorithms using simulated datasets. It outperformed all other clustering methods that were tested in the simulation studies. It was also used to cluster real datasets arising from document analysis, handwritten image analysis and cancer research. It successfully divided a set of documents based on their topics, hand written images based on different styles of writing digits and identified tissue and mutation specificity of chemotherapy treatments.
Efficient Multiscale Gaussian Process Regression using Hierarchical Clustering
Zhang, Z., Duraisamy, K., Gumerov, N. A.
Machine Learning manuscript No. (will be inserted by the editor) Abstract Standard Gaussian Process (GP) regression, a powerful machine learning tool, is computationally expensive when it is applied to large datasets, and potentially inaccurate when data points are sparsely distributed in a highdimensional feature space. To address these challenges, a new multiscale, sparsified GP algorithm is formulated, with the goal of application to large scientific computing datasets. In this approach, the data is partitioned into clusters and the cluster centers are used to define a reduced training set, resulting in an improvement over standard GPs in terms of training and evaluation costs. Further, a hierarchical technique is used to adaptively map the local covariance representation to the underlying sparsity of the feature space, leading to improved prediction accuracy when the data distribution is highly nonuniform. A theoretical investigation of the computational complexity of the algorithm is presented. The efficacy of this method is then demonstrated on smooth and discontinuous analytical functions and on data from a direct numerical simulation of turbulent combustion. Keywords Gaussian Processes, Sparse regression, Clustering. 1 Introduction The rapid growth in computing power has resulted in the generation of massive amounts of highly-resolved datasets in many fields of science and engineering.
Clustering by Hierarchical Nearest Neighbor Descent (H-NND)
Previously in 2014, we proposed the Nearest Descent (ND) method, capable of generating an efficient Graph, called the in-tree (IT). Due to some beautiful and effective features, this IT structure proves well suited for data clustering. Although there exist some redundant edges in IT, they usually have salient features and thus it is not hard to remove them. Subsequently, in order to prevent the seemingly redundant edges from occurring, we proposed the Nearest Neighbor Descent (NND) by adding the "Neighborhood" constraint on ND. Consequently, clusters automatically emerged, without the additional requirement of removing the redundant edges. However, NND proved still not perfect, since it brought in a new yet worse problem, the "over-partitioning" problem. Now, in this paper, we propose a method, called the Hierarchical Nearest Neighbor Descent (H-NND), which overcomes the over-partitioning problem of NND via using the hierarchical strategy. Specifically, H-NND uses ND to effectively merge the over-segmented sub-graphs or clusters that NND produces. Like ND, H-NND also generates the IT structure, in which the redundant edges once again appear. This seemingly comes back to the situation that ND faces. However, compared with ND, the redundant edges in the IT structure generated by H-NND generally become more salient, thus being much easier and more reliable to be identified even by the simplest edge-removing method which takes the edge length as the only measure. In other words, the IT structure constructed by H-NND becomes more fitted for data clustering. We prove this on several clustering datasets of varying shapes, dimensions and attributes. Besides, compared with ND, H-NND generally takes less computation time to construct the IT data structure for the input data.
Multivariate response and parsimony for Gaussian cluster-weighted models
Dang, Utkarsh J., Punzo, Antonio, McNicholas, Paul D., Ingrassia, Salvatore, Browne, Ryan P.
A family of parsimonious Gaussian cluster-weighted models is presented. This family concerns a multivariate extension to cluster-weighted modelling that can account for correlations between multivariate responses. Parsimony is attained by constraining parts of an eigen-decomposition imposed on the component covariance matrices. A sufficient condition for identifiability is provided and an expectation-maximization algorithm is presented for parameter estimation. Model performance is investigated on both synthetic and classical real data sets and compared with some popular approaches. Finally, accounting for linear dependencies in the presence of a linear regression structure is shown to offer better performance, vis-\`{a}-vis clustering, over existing methodologies.