Clustering
How many types of Cluster Analysis and Techniques using R
K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In'k' means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. These two steps are repeated till the within cluster variation cannot be reduced any further.
Spectral Clustering using PCKID - A Probabilistic Cluster Kernel for Incomplete Data
Løkse, Sigurd, Bianchi, Filippo Maria, Salberg, Arnt-Børre, Jenssen, Robert
In this paper, we propose PCKID, a novel, robust, kernel function for spectral clustering, specifically designed to handle incomplete data. By combining posterior distributions of Gaussian Mixture Models for incomplete data on different scales, we are able to learn a kernel for incomplete data that does not depend on any critical hyperparameters, unlike the commonly used RBF kernel. To evaluate our method, we perform experiments on two real datasets. PCKID outperforms the baseline methods for all fractions of missing values and in some cases outperforms the baseline methods with up to 25 percentage points.
K-Means & Other Clustering Algorithms: A Quick Intro with Python
Clustering is the grouping of objects together so that objects belonging in the same group (cluster) are more similar to each other than those in other groups (clusters). In this intro cluster analysis tutorial, we'll check out a few algorithms in Python so you can get a basic understanding of the fundamentals of clustering on a real dataset. For the clustering problem, we will use the famous Zachary's Karate Club dataset. The story behind the data set is quite simple: There was a Karate Club that had an administrator "John A" and an instructor "Mr. Then a conflict arose between them, causing the students (Nodes) to split into two groups.
Towards a Unified Taxonomy of Biclustering Methods
Ignatov, Dmitry I., Watson, Bruce W.
Being an unsupervised machine learning and data mining technique, biclustering and its multimodal extensions are becoming popular tools for analysing object-attribute data in different domains. Apart from conventional clustering techniques, biclustering is searching for homogeneous groups of objects while keeping their common description, e.g., in binary setting, their shared attributes. In bioinformatics, biclustering is used to find genes, which are active in a subset of situations, thus being candidates for biomarkers. However, the authors of those biclustering techniques that are popular in gene expression analysis, may overlook the existing methods. For instance, BiMax algorithm is aimed at finding biclusters, which are well-known for decades as formal concepts. Moreover, even if bioinformatics classify the biclustering methods according to reasonable domain-driven criteria, their classification taxonomies may be different from survey to survey and not full as well. So, in this paper we propose to use concept lattices as a tool for taxonomy building (in the biclustering domain) and attribute exploration as means for cross-domain taxonomy completion.
Semi-supervised Learning for Discrete Choice Models
Yang, Jie, Shebalov, Sergey, Klabjan, Diego
We introduce a semi-supervised discrete choice model to calibrate discrete choice models when relatively few requests have both choice sets and stated preferences but the majority only have the choice sets. Two classic semi-supervised learning algorithms, the expectation maximization algorithm and the cluster-and-label algorithm, have been adapted to our choice modeling problem setting. We also develop two new algorithms based on the cluster-and-label algorithm. The new algorithms use the Bayesian Information Criterion to evaluate a clustering setting to automatically adjust the number of clusters. Two computational studies including a hotel booking case and a large-scale airline itinerary shopping case are presented to evaluate the prediction accuracy and computational effort of the proposed algorithms. Algorithmic recommendations are rendered under various scenarios.
Simultaneous Clustering and Ensemble
Tao, Zhiqiang (Northeastern University) | Liu, Hongfu (Northeastern University) | Fu, Yun (Northeastern University)
Ensemble Clustering (EC) has gained a great deal of attention throughout the fields of data mining and machine learning, since it emerged as an effective and robust clustering framework. Typically, EC methods try to fuse multiple basic partitions (BPs) into a consensus one, of which each BP is obtained by performing traditional clustering method on the same dataset. One promising direction for ensemble clustering is to derive pairwise similarity from BPs, and then transform it as a graph partition problem. However, these graph based methods may suffer from an information loss when computing the similarity between data points, because they only utilize the categorical data provided by multiple BPs, yet neglect rich information from raw features. This problem can badly undermine the underlying cluster structure in the original feature space, and thus degrade the clustering performance. In light of this, we propose a novel Simultaneous Clustering and Ensemble (SCE) framework to alleviate such detrimental effect, which employs the similarity matrix from raw features to enhance the co-association matrix summarized by multiple BPs. Two neat closed-form solutions given by eigenvalue decomposition are provided for SCE. Experiments conducted on 16 real-world datasets demonstrate the effectiveness of the proposed SCE over the traditional clustering and state-of-the-art ensemble clustering methods. Moreover, several impact factors that may affect our method are also explored extensively.
A General Clustering Agreement Index: For Comparing Disjoint and Overlapping Clusters
Rabbany, Reihaneh (University of Alberta) | Zaïane, Osmar R. (University of Alberta)
A clustering agreement index quantifies the similarity between two given clusterings. It is most commonly used to compare the results obtained from different clustering algorithms against the ground-truth clustering in the benchmark datasets. In this paper, we present a general Clustering Agreement Index (CAI) for comparing disjoint and overlapping clusterings. CAI is generic and introduces a family of clustering agreement indexes. In particular, the two widely used indexes of Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), are special cases of the CAI. Our index, therefore, provides overlapping extensions for both these commonly used indexes, whereas their original formulations are only defined for disjoint cases. Lastly, unlike previous indexes, CAI is flexible and can be adapted to incorporate the structure of the data, which is important when comparing clusters in networks, a.k.a communities.
Latent Tree Analysis
Zhang, Nevin L. (The Hong Kong University of Science and Technology) | Poon, Leonard K. M. (The Education University of Hong Kong)
Latent tree analysis seeks to model the correlations amonga set of random variables using a tree of latent variables. It was proposed as an improvement to latent class analysis—a method widely used in social sciences and medicine to identify homogeneous subgroups in a population. It provides new and fruitful perspectives on a number of machine learningareas, including cluster analysis, topic detection, and deep probabilistic modeling. This paper gives an overview of the research on latent tree analysis and various ways it is used inpractice.
Compressed K-Means for Large-Scale Clustering
Shen, Xiaobo (Nanjing University of Science and Technology) | Liu, Weiwei (University of Technology Sydney) | Tsang, Ivor (University of Technology Sydney) | Shen, Fumin (University of Electronic Science and Technology of China) | Sun, Quan-Sen (Nanjing University of Science and Technology)
Large-scale clustering has been widely used in many applications, and has received much attention. Most existing clustering methods suffer from both expensive computation and memory costs when applied to large-scale datasets. In this paper, we propose a novel clustering method, dubbed compressed k-means (CKM), for fast large-scale clustering. Specifically, high-dimensional data are compressed into short binary codes, which are well suited for fast clustering. CKM enjoys two key benefits: 1) storage can be significantly reduced by representing data points as binary codes; 2) distance computation is very efficient using Hamming metric between binary codes. We propose to jointly learn binary codes and clusters within one framework. Extensive experimental results on four large-scale datasets, including two million-scale datasets demonstrate that CKM outperforms the state-of-the-art large-scale clustering methods in terms of both computation and memory cost, while achieving comparable clustering accuracy.
Robust Manifold Matrix Factorization for Joint Clustering and Feature Extraction
Zhang, Lefei (Wuhan University) | Zhang, Qian (Alibaba Group) | Du, Bo (Wuhan University) | Tao, Dacheng (University of Technology Sydney) | You, Jane (The Hong Kong Polytechnic University)
Low-rank matrix approximation has been widely used for data subspace clustering and feature representation in many computer vision and pattern recognition applications. However, in order to enhance the discriminability, most of the matrix approximation based feature extraction algorithms usually generate the cluster labels by certain clustering algorithm (e.g., the kmeans) and then perform the matrix approximation guided by such label information. In addition, the noises and outliers in the dataset with large reconstruction errors will easily dominate the objective function by the conventional ℓ 2 -norm based squared residue minimization. In this paper, we propose a novel clustering and feature extraction algorithm based on an unified low-rank matrix factorization framework, which suggests that the observed data matrix can be approximated by the production of projection matrix and low dimensional representation, among which the low-dimensional representation can be approximated by the cluster indicator and latent feature matrix simultaneously. Furthermore, we have proposed using the ℓ 2,1 -norm and integrating the manifold regularization to further promote the proposed model. A novel Augmented Lagrangian Method (ALM) based procedure is designed to effectively and efficiently seek the optimal solution of the problem. The experimental results in both clustering and feature extraction perspectives demonstrate the superior performance of the proposed method.