Clustering
Spectral Clustering with Imbalanced Data
Qian, Jing, Saligrama, Venkatesh
Spectral clustering is sensitive to how graphs are constructed from data particularly when proximal and imbalanced clusters are present. We show that Ratio-Cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced data since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced data. Our approach parameterizes a family of graphs, by adaptively modulating node degrees on a fixed node set, to yield a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach. We demonstrate the superiority of our method through unsupervised and semi-supervised experiments on synthetic and real data sets.
Variational Bayes Approximations for Clustering via Mixtures of Normal Inverse Gaussian Distributions
Subedi, Sanjeena, McNicholas, Paul D.
The use of mixture models for clustering, referred to as model-based clustering, has become increasingly popular since the work of Wolfe (1963). A wide variety of finite mixture models has been studied extensively within the literature to date. Amongst these, the Gaussian mixture model has received special attention due to its mathematical tractability and the relative computational simplicity associated with parameter estimation. However, the Gaussian mixture model is not without limitations; for instance, the component densities are restricted to being symmetric.
Ensemble approaches for improving community detection methods
Dahlin, Johan, Svenson, Pontus
Statistical estimates can often be improved by fusion of data from several different sources. One example is so-called ensemble methods which have been successfully applied in areas such as machine learning for classification and clustering. In this paper, we present an ensemble method to improve community detection by aggregating the information found in an ensemble of community structures. This ensemble can found by re-sampling methods, multiple runs of a stochastic community detection method, or by several different community detection algorithms applied to the same network. The proposed method is evaluated using random networks with community structures and compared with two commonly used community detection methods. The proposed method when applied on a stochastic community detection algorithm performs well with low computational complexity, thus offering both a new approach to community detection and an additional community detection method.
Inconsistency of Pitman-Yor process mixtures for the number of components
Miller, Jeffrey W., Harrison, Matthew T.
In population genetics, determining the "population structure" is an important step in the analysis of sampled data. As an illustrative example, consider the impala, a species of antelope in southern Africa. Impalas are divided into two subspecies: the common impala occupying much of the eastern half of the region, and the black-faced impala inhabiting a small area in the west. While common impalas are abundant, the number of black-faced impalas has been decimated by drought, poaching, and declining resources due to human and livestock expansion. To assist conservation efforts, Lorenzen, Arctander and Siegismund (2006) collected samples from 216 impalas, and analyzed the genetic variation between/within the two subspecies. A key part of their analysis consisted of inferring the population structure -- that is, partitioning the data into distinct populations, and in particular, determining how many such populations there are. To infer the impala population structure, Lorenzen et al. employed a widely-used tool called Structure (Pritchard, Stephens and Donnelly, 2000) which, in the simplest version, models the data as a finite mixture, with each component in the mixture corresponding to a dis-Supported in part by NSF grant DMS-1007593 and DARPA contract FA8650-11-1-715.
Mixtures of Common Skew-t Factor Analyzers
Murray, Paula M., McNicholas, Paul D., Browne, Ryan P.
A mixture of common skew-t factor analyzers model is introduced for model-based clustering of high-dimensional data. By assuming common component factor loadings, this model allows clustering to be performed in the presence of a large number of mixture components or when the number of dimensions is too large to be well-modelled by the mixtures of factor analyzers model or a variant thereof. Furthermore, assuming that the component densities follow a skew-t distribution allows robust clustering of skewed data. The alternating expectation-conditional maximization algorithm is employed for parameter estimation. We demonstrate excellent clustering performance when our model is applied to real and simulated data.This paper marks the first time that skewed common factors have been used.
The algorithm of noisy k-means
Brunet, Camille, Loustau, Sรฉbastien
In this note, we introduce a new algorithm to deal with finite dimensional clustering with errors in variables. The design of this algorithm is based on recent theoretical advances (see Loustau (2013a,b)) in statistical learning with errors in variables. As the previous mentioned papers, the algorithm mixes different tools from the inverse problem literature and the machine learning community. Coarsely, it is based on a two-step procedure: (1) a deconvolution step to deal with noisy inputs and (2) Newton's iterations as the popular k-means.
Multi-View K-Means Clustering on Big Data
Cai, Xiao (University of Texas at Arlington) | Nie, Feiping (University of Texas at Arlington) | Huang, Heng (University of Texas at Arlington)
In past decade, more and more data are collected from multiple sources or represented by multiple views, where different views describe distinct perspectives of the data. Although each view could be individually used for finding patterns by clustering, the clustering performance could be more accurate by exploring the rich information among multiple views. Several multi-view clustering methods have been proposed to unsupervised integrate different views of data. However, they are graph based approaches, e.g. based on spectral clustering, such that they cannot handle the large-scale data. How to combine these heterogeneous features for unsupervised large-scale data clustering has become a challenging problem. In this paper, we propose a new robust large-scale multi-view clustering method to integrate heterogeneous representations of large-scale data. We evaluate the proposed new methods by six benchmark data sets and compared the performance with several commonly used clustering approaches as well as the baseline multi-view clustering methods. In all experimental results, our proposed methods consistently achieve superiors clustering performances.
Large-Scale Spectral Clustering on Graphs
Liu, Jialu (University of Illinois at Urbana-Champaign) | Wang, Chi (University of Illinois at Urbana-Champaign) | Danilevsky, Marina (University of Illinois at Urbana-Champaign) | Han, Jiawei (University of Illinois at Urbana-Champaign)
Graph clustering has received growing attention in recent years as an important analytical technique, both due to the prevalence of graph data, and the usefulness of graph structures for exploiting intrinsic data characteristics.However, as graph data grows in scale, it becomes increasingly more challenging to identify clusters. In this paper we propose an efficient clustering algorithm for large-scale graph data using spectral methods. The key idea is to repeatedly generate a small number of "supernodes" connected to the regular nodes, in order to compress the original graph into a sparse bipartite graph. By clustering the bipartite graph using spectral methods, we are able to greatly improve efficiency without losing considerable clustering power. Extensive experiments show the effectiveness and efficiency of our approach.
Learning Finite Beta-Liouville Mixture Models via Variational Bayes for Proportional Data Clustering
Fan, Wentao (Concordia University) | Bouguila, Nizar (Concordia University)
During the past decade, finite mixture modeling has become a well-established technique in data analysis and clustering. This paper focus on developing a variational inference framework to learn finite Beta-Liouville mixture models that have been proposed recently as an efficient way for proportional data clustering. In contrast to the conventional expectation maximization (EM) algorithm, commonly used for learning finite mixture models, the proposed algorithm has the advantages that it is more efficient from a computational point of view and by preventing over- and under-fitting problems. Moreover, the complexity of the mixture model (i.e. the number of components) can be determined automatically and simultaneously with the parameters estimation in a closed form as part of the Bayesian inference procedure. The merits of the proposed approach are shown using both artificial data sets and two interesting and challenging real applications namely dynamic textures clustering and facial expression recognition.
Algorithm Portfolios Based on Cost-Sensitive Hierarchical Clustering
Malitsky, Yuri (Cork Constraint Computation Centre) | Sabharwal, Ashish (IBM Watson Research Center) | Samulowitz, Horst (IBM Watson Research Center) | Sellmann, Meinolf (IBM Watson Research Center)
Different solution approaches for combinatorial problems often exhibit incomparable performance that depends on the concrete problem instanceto be solved. Algorithm portfolios aim to combine the strengths of multiple algorithmic approaches by training a classifier that selects or schedules solvers dependent on the given instance. We devise a new classifier that selects solvers based on a cost-sensitive hierarchical clustering model. Experimental results on SAT and MaxSAT show that the new method outperforms the most effective portfolio builders to date.