Clustering
A Survey on Multi-View Clustering
Chao, Guoqing, Sun, Shiliang, Bi, Jinbo
Clustering [1] is a paradigm to classify the subjects into several groups based on their similarity information. As we know that clustering is a fundamental task in machine learning, pattern recognition and data mining fields and it has widespread applications. With the obtained groups by clustering methods, further analysis tasks can be conducted to achieve different ultimate goals. However, traditional clustering methods only use one feature set or one view information of the subjects while multiple feature sets or multiple view information of these subjects are available. The subjects of interest with multiple feature sets or multiple view information are the so called multi-view data. Multi-view data are very common in real-world applications due to the innate properties, or collecting from different sources. For instance, a web page can be described by the words appearing on the web page itself and the words underlying all links pointing to the web page from other pages in nature. In multimedia content understanding, the multimedia segments can be simultaneously described by their video signals from visual camera and audio signals from voice recorder devices. The existence of such multi-view data raised the interest of multi-view learning [2], [3], [4], which has been extensively studied in semi-supervised setting.
Taming Wild High Dimensional Text Data with a Fuzzy Lash
The bag of words (BOW) represents a corpus in a matrix whose elements are the frequency of words. However, each row in the matrix is a very high-dimensional sparse vector. Dimension reduction (DR) is a popular method to address sparsity and high-dimensionality issues. Among different strategies to develop DR method, Unsupervised Feature Transformation (UFT) is a popular strategy to map all words on a new basis to represent BOW. The recent increase of text data and its challenges imply that DR area still needs new perspectives. Although a wide range of methods based on the UFT strategy has been developed, the fuzzy approach has not been considered for DR based on this strategy. This research investigates the application of fuzzy clustering as a DR method based on the UFT strategy to collapse BOW matrix to provide a lower-dimensional representation of documents instead of the words in a corpus. The quantitative evaluation shows that fuzzy clustering produces superior performance and features to Principal Components Analysis (PCA) and Singular Value Decomposition (SVD), two popular DR methods based on the UFT strategy.
A Novel Bayesian Cluster Enumeration Criterion for Unsupervised Learning
Teklehaymanot, Freweyni K., Muma, Michael, Zoubir, Abdelhak M.
We derive a new Bayesian Information Criterion (BIC) from first principles by formulating the problem of estimating the number of clusters in an observed data set as maximization of the posterior probability of the candidate models. Given that some mild assumptions are satisfied, we provide a general BIC expression for a broad class of data distributions. This serves as an important milestone when deriving the BIC for specific data distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed observations. We show that incorporating data structure of the clustering problem into the derivation of the BIC results in an expression whose penalty term is different from that of the original BIC. We propose a two-step cluster enumeration algorithm. First, a model-based unsupervised learning algorithm partitions the data according to a given set of candidate models. Subsequently, the optimal cluster number is determined as the one associated to the model for which the proposed BIC is maximal. The performance of the proposed criterion is tested using synthetic and real data sets. Despite the fact that the original BIC is a generic criterion which does not include information about the specific model selection problem at hand, it has been widely used in the literature to estimate the number of clusters in an observed data set. We, therefore, consider it as a benchmark comparison. Simulation results show that our proposed criterion outperforms the existing cluster enumeration methods that are based on the original BIC.
Wade , Ghahramani : Bayesian Cluster Analysis: Point Estimation and Credible Balls
Clustering is widely studied in statistics and machine learning, with applications in a variety of fields. As opposed to popular algorithms such as agglomerative hierarchical clustering or k-means which return a single clustering solution, Bayesian nonparametric models provide a posterior over the entire space of partitions, allowing one to assess statistical properties, such as uncertainty on the number of clusters. However, an important problem is how to summarize the posterior; the huge dimension of partition space and difficulties in visualizing it add to this problem. In a Bayesian analysis, the posterior of a real-valued parameter of interest is often summarized by reporting a point estimate such as the posterior mean along with 95% credible intervals to characterize uncertainty. In this paper, we extend these ideas to develop appropriate point estimates and credible sets to summarize the posterior of the clustering structure based on decision and information theoretic techniques.
ClustGeo: an R package for hierarchical clustering with spatial constraints
Chavent, Marie, Kuentz-Simonet, Vanessa, Labenne, Amaury, Saracco, Jรฉrรดme
In this paper, we propose a Ward-like hierarchical clustering algorithm including spatial/geographical constraints. Two dissimilarity matrices $D_0$ and $D_1$ are inputted, along with a mixing parameter $\alpha \in [0,1]$. The dissimilarities can be non-Euclidean and the weights of the observations can be non-uniform. The first matrix gives the dissimilarities in the "feature space" and the second matrix gives the dissimilarities in the "constraint space". The criterion minimized at each stage is a convex combination of the homogeneity criterion calculated with $D_0$ and the homogeneity criterion calculated with $D_1$. The idea is then to determine a value of $\alpha$ which increases the spatial contiguity without deteriorating too much the quality of the solution based on the variables of interest i.e. those of the feature space. This procedure is illustrated on a real dataset using the R package ClustGeo.
A Quantum Extension of Variational Bayes Inference
Miyahara, Hideyuki, Sughiyama, Yuki
Institute of Industrial Science, The University of Tokyo, 4-6-1, Komaba, Meguro-ku, Tokyo 153-8505, Japan (Dated: December 14, 2017) Variational Bayes (VB) inference is one of the most important algorithms in machine learning and widely used in engineering and industry. However, VB is known to suffer from the problem of local optima. In this Letter, we generalize VB by using quantum mechanics, and propose a new algorithm, which we call quantum annealing variational Bayes (QA VB) inference. We then show that QA VB drastically improve the performance of VB by applying them to a clustering problem described by a Gaussian mixture model. Finally, we discuss an intuitive understanding on how QA VB works well.
Incremental Eigenpair Computation for Graph Laplacian Matrices: Theory and Applications
Chen, Pin-Yu, Zhang, Baichuan, Hasan, Mohammad Al
The smallest eigenvalues and the associated eigenvectors (i.e., eigenpairs) of a graph Laplacian matrix have been widely used in spectral clustering and community detection. However, in real-life applications the number of clusters or communities (say, $K$) is generally unknown a-priori. Consequently, the majority of the existing methods either choose $K$ heuristically or they repeat the clustering method with different choices of $K$ and accept the best clustering result. The first option, more often, yields suboptimal result, while the second option is computationally expensive. In this work, we propose an incremental method for constructing the eigenspectrum of the graph Laplacian matrix. This method leverages the eigenstructure of graph Laplacian matrix to obtain the $K$-th smallest eigenpair of the Laplacian matrix given a collection of all previously computed $K-1$ smallest eigenpairs. Our proposed method adapts the Laplacian matrix such that the batch eigenvalue decomposition problem transforms into an efficient sequential leading eigenpair computation problem. As a practical application, we consider user-guided spectral clustering. Specifically, we demonstrate that users can utilize the proposed incremental method for effective eigenpair computation and for determining the desired number of clusters based on multiple clustering metrics.
Oversampling for Imbalanced Learning Based on K-Means and SMOTE
Last, Felix, Douzas, Georgios, Bacao, Fernando
Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.
Outlier Detection by Consistent Data Selection Method
Porwal, Utkarsh, Mukund, Smruthi
Often the challenge associated with tasks like fraud and spam detection[1] is the lack of all likely patterns needed to train suitable supervised learning models. In order to overcome this limitation, such tasks are attempted as outlier or anomaly detection tasks. We also hypothesize that out- liers have behavioral patterns that change over time. Limited data and continuously changing patterns makes learning significantly difficult. In this work we are proposing an approach that detects outliers in large data sets by relying on data points that are consistent. The primary contribution of this work is that it will quickly help retrieve samples for both consistent and non-outlier data sets and is also mindful of new outlier patterns. No prior knowledge of each set is required to extract the samples. The method consists of two phases, in the first phase, consistent data points (non- outliers) are retrieved by an ensemble method of unsupervised clustering techniques and in the second phase a one class classifier trained on the consistent data point set is ap- plied on the remaining sample set to identify the outliers. The approach is tested on three publicly available data sets and the performance scores are competitive.
Improving Malware Detection Accuracy by Extracting Icon Information
Silva, Pedro, Akhavan-Masouleh, Sepehr, Li, Li
Detecting PE malware files is now commonly approached using statistical and machine learning models. While these models commonly use features extracted from the structure of PE files, we propose that icons from these files can also help better predict malware. We propose an innovative machine learning approach to extract information from icons. Our proposed approach consists of two steps: 1) extracting icon features using summary statics, histogram of gradients (HOG), and a convolutional autoencoder, 2) clustering icons based on the extracted icon features. Using publicly available data and by using machine learning experiments, we show our proposed icon clusters significantly boost the efficacy of malware prediction models. In particular, our experiments show an average accuracy increase of 10% when icon clusters are used in the prediction model.