Clustering
Ambiguity-Driven Fuzzy C-Means Clustering: How to Detect Uncertain Clustered Records
Ghaffari, Meysam, Ghadiri, Nasser
As a well-known clustering algorithm, Fuzzy C-Means (FCM) allows each input sample to belong to more than one cluster, providing more flexibility than non-fuzzy clustering methods. However, the accuracy of FCM is subject to false detections caused by noisy records, weak feature selection and low certainty of the algorithm in some cases. The false detections are very important in some decision-making application domains like network security and medical diagnosis, where weak decisions based on such false detections may lead to catastrophic outcomes. They are mainly emerged from making decisions about a subset of records that do not provide enough evidence to make a good decision. In this paper, we propose a method for detecting such ambiguous records in FCM by introducing a certainty factor to decrease invalid detections. This approach enables us to send the detected ambiguous records to another discrimination method for a deeper investigation, thus increasing the accuracy by lowering the error rate. Most of the records are still processed quickly and with low error rate which prevents performance loss compared to similar hybrid methods. Experimental results of applying the proposed method on several datasets from different domains show a significant decrease in error rate as well as improved sensitivity of the algorithm.
Automatic Dimension Selection for a Non-negative Factorization Approach to Clustering Multiple Random Graphs
Lee, Nam H., Wang, I-Jeng, Park, Youngser, Priebe, Care E., Rosen, Michael
We consider a problem of grouping multiple graphs into several clusters using singular value thesholding and non-negative factorization. We derive a model selection information criterion to estimate the number of clusters. We demonstrate our approach using "Swimmer data set" as well as simulated data set, and compare its performance with two standard clustering algorithms.
Spectral Clustering of Graphs with the Bethe Hessian
Saade, Alaa, Krzakala, Florent, Zdeborovรก, Lenka
Recently, it has been argued that using instead a more complicated, non-symmetric and higher dimensional operator, related to the non-backtracking walk on the graph, leads to improved performance in detecting clusters, and even to optimal performance for the stochastic block model. Here, we propose to use instead a simpler object, a symmetric real matrix known as the Bethe Hessian operator, or deformed Laplacian. We show that this approach combines the performances of the non-backtracking operator, thus detecting clusters all the way down to the theoretical limit in the stochastic block model, with the computational, theoretical and memory advantages of real symmetric matrices. Clustering a graph into groups or functional modules (sometimes called communities) is a central task in many fields ranging from machine learning to biology. A common benchmark for this problem is to consider graphs generated by the stochastic block model (SBM) [7, 22]. In this case, one considersn vertices and each of them has a group label g v { 1,...,q} . A graph is then created as follows: all edges are generated independently according to aq q matrix p of probabilities, with Pr[A u,v 1] p g u,g v.
Unsupervised deconvolution of dynamic imaging reveals intratumor vascular heterogeneity
Chen, Li, Choyke, Peter L., Wang, Niya, Clarke, Robert, Bhujwalla, Zaver M., Hillman, Elizabeth M. C., Wang, Yue
Department of Biomedical Engineering, Biomedical Imaging Center, Rensselaer Polytechnic Institute, Troy, NY 12180, USA 2 With the existence of biologically distinctive malignant cells originated within the same tumor, intratumor functional heterogeneity is present in many cancers and is often manifested by the intermingled vascular compartments with distinct pharmacokinetics. However, intratumor vascular heterogeneity cannot be resolved directly by most in vivo dynamic imaging. We developed multi-tissue compartment modeling (MTCM), a completely unsupervised method of deconvoluting dynamic imaging series from heterogeneous tumors that can improve vascular characterization in many biological contexts. Applying MTCM to dynamic contrast-enhanced magnetic resonance imaging of breast cancers revealed characteristic intratumor vascular heterogeneity and therapeutic responses that were otherwise undetectable. MTCM is readily applicable to other dynamic imaging modalities for studying intratumor functional and phenotypic heterogeneity, together with a variety of foreseeable applications in the clinic. A formal mathematical description of the method and its detailed implementation is available in Methods.
Axiomatic Construction of Hierarchical Clustering in Asymmetric Networks
Carlsson, Gunnar, Mรฉmoli, Facundo, Ribeiro, Alejandro, Segarra, Santiago
This paper considers networks where relationships between nodes are represented by directed dissimilarities. The goal is to study methods for the determination of hierarchical clusters, i.e., a family of nested partitions indexed by a connectivity parameter, induced by the given dissimilarity structures. Our construction of hierarchical clustering methods is based on defining admissible methods to be those methods that abide by the axioms of value - nodes in a network with two nodes are clustered together at the maximum of the two dissimilarities between them - and transformation - when dissimilarities are reduced, the network may become more clustered but not less. Several admissible methods are constructed and two particular methods, termed reciprocal and nonreciprocal clustering, are shown to provide upper and lower bounds in the space of admissible methods. Alternative clustering methodologies and axioms are further considered. Allowing the outcome of hierarchical clustering to be asymmetric, so that it matches the asymmetry of the original data, leads to the inception of quasi-clustering methods. The existence of a unique quasi-clustering method is shown. Allowing clustering in a two-node network to proceed at the minimum of the two dissimilarities generates an alternative axiomatic construction. There is a unique clustering method in this case too. The paper also develops algorithms for the computation of hierarchical clusters using matrix powers on a min-max dioid algebra and studies the stability of the methods proposed. We proved that most of the methods introduced in this paper are such that similar networks yield similar hierarchical clustering results. Algorithms are exemplified through their application to networks describing internal migration within states of the United States (U.S.) and the interrelation between sectors of the U.S. economy.
An application of topological graph clustering to protein function prediction
Bowman, R. Sean, Heisterkamp, Douglas, Johnson, Jesse, O'Donnol, Danielle
We use a semisupervised learning algorithm based on a topological data analysis approach to assign functional categories to yeast proteins using similarity graphs. This new approach to analyzing biological networks yields results that are as good as or better than state of the art existing approaches.
A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets
Godfrey, Daniel, Johns, Caley, Meyer, Carl, Race, Shaina, Sadek, Carol
Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic noise. In our case it would be extraneous tweets that are unrelated to dominant themes. To combat this problem, we created an algorithm that combined the DBSCAN algorithm and a consensus matrix. This way we are left with the tweets that are related to those dominant themes. We then used cluster analysis to find those topics that the tweets describe. We clustered the tweets using k-means, a commonly used clustering algorithm, and Non-Negative Matrix Factorization (NMF) and compared the results. The two algorithms gave similar results, but NMF proved to be faster and provided more easily interpreted results. We explored our results using two visualization tools, Gephi and Wordle.
A Bi-clustering Framework for Consensus Problems
Tepper, Mariano, Sapiro, Guillermo
We consider grouping as a general characterization for problems such as clustering, community detection in networks, and multiple parametric model estimation. We are interested in merging solutions from different grouping algorithms, distilling all their good qualities into a consensus solution. In this paper, we propose a bi-clustering framework and perspective for reaching consensus in such grouping problems. In particular, this is the first time that the task of finding/fitting multiple parametric models to a dataset is formally posed as a consensus problem. We highlight the equivalence of these tasks and establish the connection with the computational Gestalt program, that seeks to provide a psychologically-inspired detection theory for visual events. We also present a simple but powerful bi-clustering algorithm, specially tuned to the nature of the problem we address, though general enough to handle many different instances inscribed within our characterization. The presentation is accompanied with diverse and extensive experimental results in clustering, community detection, and multiple parametric model estimation in image processing applications.
Cluster based RBF Kernel for Support Vector Machines
Czarnecki, Wojciech Marian, Tabor, Jacek
In the classical Gaussian SVM classification we use the feature space projection transforming points to normal distributions with fixed covariance matrices (identity in the standard RBF and the covariance of the whole dataset in Mahalanobis RBF). In this paper we add additional information to Gaussian SVM by considering local geometry-dependent feature space projection. We emphasize that our approach is in fact an algorithm for a construction of the new Gaussian-type kernel. We show that better (compared to standard RBF and Mahalanobis RBF) classification results are obtained in the simple case when the space is preliminary divided by k-means into two sets and points are represented as normal distributions with a covariances calculated according to the dataset partitioning. We call the constructed method C$_k$RBF, where $k$ stands for the amount of clusters used in k-means. We show empirically on nine datasets from UCI repository that C$_2$RBF increases the stability of the grid search (measured as the probability of finding good parameters).
Warped Mixtures for Nonparametric Cluster Shapes
Iwata, Tomoharu, Duvenaud, David, Ghahramani, Zoubin
A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.