WoCE: a framework for clustering ensemble by exploiting the wisdom of Crowds theory

arXiv.org Machine Learning

The Wisdom of Crowds (WOC), as a theory in the social science, gets a new paradigm in computer science. The WOC theory explains that the aggregate decision made by a group is often better than those of its individual members if specific conditions are satisfied. This paper presents a novel framework for unsupervised and semi-supervised cluster ensemble by exploiting the WOC theory. We employ four conditions in the WOC theory, i.e., diversity, independency, decentralization and aggregation, to guide both the constructing of individual clustering results and the final combination for clustering ensemble. Firstly, independency criterion, as a novel mapping system on the raw data set, removes the correlation between features on our proposed method. Then, decentralization as a novel mechanism generates high-quality individual clustering results. Next, uniformity as a new diversity metric evaluates the generated clustering results. Further, weighted evidence accumulation clustering method is proposed for the final aggregation without using thresholding procedure. Experimental study on varied data sets demonstrates that the proposed approach achieves superior performance to state-of-the-art methods.


Consensus Guided Unsupervised Feature Selection

AAAI Conferences

Feature selection has been widely recognized as one of the key problems in data mining and machine learning community, especially for high-dimensional data with redundant information, partial noises and outliers. Recently, unsupervised feature selection attracts substantial research attentions since data acquisition is rather cheap today but labeling work is still expensive and time consuming. This is specifically useful for effective feature selection of clustering tasks. Recent works using sparse projection with pre-learned pseudo labels achieve appealing results; however, they generate pseudo labels with all features so that noisy and ineffective features degrade the cluster structure and further harm the performance of feature selection; besides, these methods suffer from complex composition of multiple constraints and computational inefficiency, e.g., eigen-decomposition. Differently, in this work we introduce consensus clustering for pseudo labeling, which gets rid of expensive eigen-decomposition and provides better clustering accuracy with high robustness. In addition, complex constraints such as non-negative are removed due to the crisp indicators of consensus clustering. Specifically, we propose one efficient formulation for our unsupervised feature selection by using the utility function and provide theoretical analysis on optimization rules and model convergence. Extensive experiments on several popular data sets demonstrate that our methods are superior to the most recent state-of-the-art works in terms of NMI.


Variable Selection Methods for Model-based Clustering

arXiv.org Machine Learning

Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.


Randomized Structural Sparsity via Constrained Block Subsampling for Improved Sensitivity of Discriminative Voxel Identification

arXiv.org Machine Learning

In this paper, we consider voxel selection for functional Magnetic Resonance Imaging (fMRI) brain data with the aim of finding a more complete set of probably correlated discriminative voxels, thus improving interpretation of the discovered potential biomarkers. The main difficulty in doing this is an extremely high dimensional voxel space and few training samples, resulting in unreliable feature selection. In order to deal with the difficulty, stability selection has received a great deal of attention lately, especially due to its finite sample control of false discoveries and transparent principle for choosing a proper amount of regularization. However, it fails to make explicit use of the correlation property or structural information of these discriminative features and leads to large false negative rates. In other words, many relevant but probably correlated discriminative voxels are missed. Thus, we propose a new variant on stability selection "randomized structural sparsity", which incorporates the idea of structural sparsity. Numerical experiments demonstrate that our method can be superior in controlling for false negatives while also keeping the control of false positives inherited from stability selection.


Consensus Clustering: An Embedding Perspective, Extension and Beyond

arXiv.org Machine Learning

Consensus clustering fuses diverse basic partitions (i.e., clustering results obtained from conventional clustering methods) into an integrated one, which has attracted increasing attention in both academic and industrial areas due to its robust and effective performance. Tremendous research efforts have been made to thrive this domain in terms of algorithms and applications. Although there are some survey papers to summarize the existing literature, they neglect to explore the underlying connection among different categories. Differently, in this paper we aim to provide an embedding prospective to illustrate the consensus mechanism, which transfers categorical basic partitions to other representations (e.g., binary coding, spectral embedding, etc) for the clustering purpose. To this end, we not only unify two major categories of consensus clustering, but also build an intuitive connection between consensus clustering and graph embedding. Moreover, we elaborate several extensions of classical consensus clustering from different settings and problems. Beyond this, we demonstrate how to leverage consensus clustering to address other tasks, such as constrained clustering, domain adaptation, feature selection, and outlier detection. Finally, we conclude this survey with future work in terms of interpretability, learnability and theoretical analysis.