Clustering
Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering
Xiong, Xuejian, Chan, Kap, Tan, Kian Lee
In this paper, a similarity-driven cluster merging method is proposed for unsuper-vised fuzzy clustering. The cluster merging method is used to resolve the problem of cluster validation. Starting with an overspecified number of clusters in the data, pairs of similar clusters are merged based on the proposed similarity-driven cluster merging criterion. The similarity between clusters is calculated by a fuzzy cluster similarity matrix, while an adaptive threshold is used for merging. In addition, a modified generalized ob- jective function is used for prototype-based fuzzy clustering. The function includes the p-norm distance measure as well as principal components of the clusters. The number of the principal components is determined automatically from the data being clustered. The properties of this unsupervised fuzzy clustering algorithm are illustrated by several experiments.
Unsupervised spectral learning
Shortreed, Susan, Meila, Marina
In spectral clustering and spectral image segmentation, the data is partioned starting from a given matrix of pairwise similarities S. the matrix S is constructed by hand, or learned on a separate training set. In this paper we show how to achieve spectral clustering in unsupervised mode. Our algorithm starts with a set of observed pairwise features, which are possible components of an unknown, parametric similarity function. This function is learned iteratively, at the same time as the clustering of the data. The algorithm shows promosing results on synthetic and real data.
Relational Data Mining Through Extraction of Representative Exemplars
Blanchard, Frรฉdรฉric, Herbin, Michel
With the growing interest on Network Analysis, Relational Data Mining is becoming an emphasized domain of Data Mining. This paper addresses the problem of extracting representative elements from a relational dataset. After defining the notion of degree of representativeness, computed using the Borda aggregation procedure, we present the extraction of exemplars which are the representative elements of the dataset. We use these concepts to build a network on the dataset. We expose the main properties of these notions and we propose two typical applications of our framework. The first application consists in resuming and structuring a set of binary images and the second in mining co-authoring relation in a research team.
Gene Expression Time Course Clustering with Countably Infinite Hidden Markov Models
Beal, Matthew, Krishnamurthy, Praveen
It is said that genes that cluster with similar expression-- that is, are co-expressed--serve similar functional roles in a process (see, for example, Eisen et al. 1998). Bioin-formaticians have more recently had access to sets of time-series measurements of genes' expression over the duration of an experiment, and have desired therefore to learn not just co-expression, but causal relationships that may help elucidate co-regulation as well. Two problematic issues hamper practical methods for clustering gene expression time course data: first, if deriving a model-based clustering metric, it is often unclear what the appropriate model complexity should be; second, the current clustering algorithms available cannot handle, and therefore disregard, the temporal information. This usually occurs when constructing a metric for the distance between any two such genes. The common practice for an experiment having T measurements of a gene's expression over time is to consider the expression as positioned in a T -dimensional space, and to perform (at worse spherical metric) clustering in that space. The result is that the clustering algorithm is invariant to arbitrary permutations of the time points, which is highly undesirable since we would like to take into account the correlations between all the genes' expression at nearby or adjacent time points.
An Iterative Locally Linear Embedding Algorithm
Kong, Deguang, Ding, Chris H. Q., Huang, Heng, Nie, Feiping
Local Linear embedding (LLE) is a popular dimension reduction method. In this paper, we first show LLE with nonnegative constraint is equivalent to the widely used Laplacian embedding. We further propose to iterate the two steps in LLE repeatedly to improve the results. Thirdly, we relax the kNN constraint of LLE and present a sparse similarity learning algorithm. The final Iterative LLE combines these three improvements. Extensive experiment results show that iterative LLE algorithm significantly improve both classification and clustering results.
Small-sample Brain Mapping: Sparse Recovery on Spatially Correlated Designs with Randomization and Clustering
Varoquaux, Gael, Gramfort, Alexandre, Thirion, Bertrand
Functional neuroimaging can measure the brain?s response to an external stimulus. It is used to perform brain mapping: identifying from these observations the brain regions involved. This problem can be cast into a linear supervised learning task where the neuroimaging data are used as predictors for the stimulus. Brain mapping is then seen as a support recovery problem. On functional MRI (fMRI) data, this problem is particularly challenging as i) the number of samples is small due to limited acquisition time and ii) the variables are strongly correlated. We propose to overcome these difficulties using sparse regression models over new variables obtained by clustering of the original variables. The use of randomization techniques, e.g. bootstrap samples, and clustering of the variables improves the recovery properties of sparse methods. We demonstrate the benefit of our approach on an extensive simulation study as well as two fMRI datasets.
Agglomerative Bregman Clustering
Telgarsky, Matus, Dasgupta, Sanjoy
This manuscript develops the theory of agglomerative clustering with Bregman divergences. Geometric smoothing techniques are developed to deal with degenerate clusters. To allow for cluster models based on exponential families with overcomplete representations, Bregman divergences are developed for nondifferentiable convex functions.
Copula Mixture Model for Dependency-seeking Clustering
We introduce a copula mixture model to perform dependency-seeking clustering when co-occurring samples from different data sources are available. The model takes advantage of the great flexibility offered by the copulas framework to extend mixtures of Canonical Correlation Analysis to multivariate data with arbitrary continuous marginal densities. We formulate our model as a non-parametric Bayesian mixture, while providing efficient MCMC inference. Experiments on synthetic and real data demonstrate that the increased flexibility of the copula mixture significantly improves the clustering and the interpretability of the results.
Efficient Active Algorithms for Hierarchical Clustering
Krishnamurthy, Akshay, Balakrishnan, Sivaraman, Xu, Min, Singh, Aarti
Advances in sensing technologies and the growth of the internet have resulted in an explosion in the size of modern datasets, while storage and processing power continue to lag behind. This motivates the need for algorithms that are efficient, both in terms of the number of measurements needed and running time. To combat the challenges associated with large datasets, we propose a general framework for active hierarchical clustering that repeatedly runs an off-the-shelf clustering algorithm on small subsets of the data and comes with guarantees on performance, measurement complexity and runtime complexity. We instantiate this framework with a simple spectral clustering algorithm and provide concrete results on its performance, showing that, under some assumptions, this algorithm recovers all clusters of size ?(log n) using O(n log^2 n) similarities and runs in O(n log^3 n) time for a dataset of n objects. Through extensive experimentation we also demonstrate that this framework is practically alluring.
Groupwise Constrained Reconstruction for Subspace Clustering
Li, Ruijiang, Li, Bin, Zhang, Ke, Jin, Cheng, Xue, Xiangyang
Reconstruction based subspace clustering methods compute a self reconstruction matrix over the samples and use it for spectral clustering to obtain the final clustering result. Their success largely relies on the assumption that the underlying subspaces are independent, which, however, does not always hold in the applications with increasing number of subspaces. In this paper, we propose a novel reconstruction based subspace clustering model without making the subspace independence assumption. In our model, certain properties of the reconstruction matrix are explicitly characterized using the latent cluster indicators, and the affinity matrix used for spectral clustering can be directly built from the posterior of the latent cluster indicators instead of the reconstruction matrix. Experimental results on both synthetic and real-world datasets show that the proposed model can outperform the state-of-the-art methods.