Clustering
Clustering Dynamic Spatio-Temporal Patterns in The Presence of Noise and Missing Data
Chen, Xi (University of Minnesota) | Faghmous, James H. (University of Minnesota and Mt. Sinai School of Medicine) | Khandelwal, Ankush (University of Minnesota) | Kumar, Vipin (University of Minnesota)
Clustering has gained widespread use, especially for static data. However, the rapid growth of spatio-temporal data from numerous instruments, such as earth-orbiting satellites, has created a need for spatio-temporal clustering methods to extract and monitor dynamic clusters. Dynamic spatio-temporal clustering faces two major challenges: First, the clusters are dynamic and may change in size, shape, and statistical properties over time. Second, numerous spatio-temporal data are incomplete, noisy, heterogeneous, and highly variable (over space and time). We propose a new spatio-temporal data mining paradigm, to autonomously identify dynamic spatio-temporal clusters in the presence of noise and missing data. Our proposed approach is more robust than traditional clustering and image segmentation techniques in the case of dynamic patterns, non-stationary, heterogeneity, and missing data. We demonstrate our method's performance on a real-world application of monitoring in-land water bodies on a global scale.
VRCA: A Clustering Algorithm for Massive Amount of Texts
Liu, Ming (Harbin Institute of Technology) | Chen, Lei (Beijing Normal University, Zhuhai) | Liu, Bingquan (Harbin Institute of Technology) | Wang, Xiaolong (Harbin Institute of Technology)
There are lots of texts appearing in the web every day. This fact enables the amount of texts in the web to explode. Therefore, how to deal with large-scale text collection becomes more and more important. Clustering is a generally acceptable solution for text organization. Via its unsupervised characteristic, users can easily dig the useful information that they desired. However, traditional clustering algorithms can only deal with small-scale text collection. When it enlarges, they lose their performances. The main reason attributes to the high-dimensional vectors generated from texts. Therefore, to cluster texts in large amount, this paper proposes a novel clustering algorithm, where only the features that can represent cluster are preserved in clusterโs vector. In this algorithm, clustering process is separated into two parts. In one part, featureโs weight is fine-tuned to make cluster partition meet an optimization function. In the other part, features are reordered and only the useful features that can represent cluster are kept in clusterโs vector. Experimental results demonstrate that our algorithm obtains high performance on both small-scale and large-scale text collections.
Sampling with Minimum Sum of Squared Similarities for Nystrom-Based Large Scale Spectral Clustering
Bouneffouf, Djallel (Canada's Michael Smith Genome Sciences Centre) | Birol, Inanc (Canada's Michael Smith Genome Sciences Centre)
The Nystrom method provides an efficient sampling approach for large scale clustering problems, by generating a low-rank matrix approximation. However, existing sampling methods are limited by accuracy and computing time. This paper proposes an improved Nystrom-based clustering algorithm with a new sampling procedure, Minimum Sum of Squared Similarities (MSSS). Experiments on synthetic and real data sets show that the proposed sampling performs with higher accuracy than existing algorithms, applied to Nystrom-based spectral clustering problems. Furthermore, we provide a theoretical analysis that allows us to define the upper bound of the Frobenius norm error of the MSSS.
Generalized Transitive Distance with Minimum Spanning Random Forest
Yu, Zhiding (Carnegie Mellon University) | Liu, Weiyang (Peking University) | Liu, Wenbo (Carnegie Mellon University) | Peng, Xi (Research Agency for Science, Technology and Research (A*STAR) Singapore) | Hui, Zhuo (Carnegie Mellon University) | Kumar, B. V. K. Vijaya (Carnegie Mellon University)
Transitive distance is an ultrametric with elegant properties for clustering. Conventional transitive distance can be found by referring to the minimum spanning tree (MST). We show that such distance metric can be generalized onto a minimum spanning random forest (MSRF) with element-wise max pooling over the set of transitive distance matrices from an MSRF. Our proposed approach is both intuitively reasonable and theoretically attractive. Intuitively, max pooling alleviates undesired short links with single MST when noise is present. Theoretically, one can see that the distance metric obtained max pooling is still an ultrametric, rendering many good clustering properties. Comprehensive experiments on data clustering and image segmentation show that MSRF with max pooling improves the clustering performance over single MST and achieves state of the art performance on the Berkeley Segmentation Dataset.
Face Clustering in Videos with Proportion Prior
Tang, Zhiqiang (Chinese Academy of Sciences) | Zhang, Yifan (Chinese Academy of Sciences) | Li, Zechao (Nanjing University of Science and Technology) | Lu, Hanqing (Chinese Academy of Sciences)
In this paper, we investigate the problem of face clustering in real-world videos. In many cases, the distribution of the face data is unbalanced. In movies or TV series videos, the leading casts appear quite often and the others appear much less. However, many clustering algorithms cannot well handle such severe unbalance between the data distribution, resulting in that the large class is split apart, and the small class is merged into the large ones and thus missing. On the other hand, the data distribution proportion information may be known beforehand. For example, we can obtain such information by counting the spoken lines of the characters in the script text. Hence, we propose to make use of the proportion prior to regularize the clustering. A Hidden Conditional Random Field(HCRF) model is presented to incorporate the proportion prior. In experiments on a public data set from real-world videos, we observe improvements on clustering performance against state-of-the-art methods.
Characterization of Scoring Rules with Distances: Application to the Clustering of Rankings
Viappiani, Paolo (Sorbonne Universities and CNRS)
Positional scoring rules are often used for rank aggregation. In this work we study how scoring rules can be formulated as the minimization of some distance measures between rankings, and we also consider a new family of aggregation methods, called biased scoring rules. This work extends a previous known observation connecting Borda count with the minimization of the sum of the Spearman distances (calculated with respect to a set of input rankings). In particular we consider generalizations of the Spearman distance that can give different weights to items and positions; we also handle the case of incomplete rank data. This has applications in the clustering of rank data, where two main steps need to be performed: aggregating rankings of the same cluster into a representative ranking (the cluster's centroid) and assigning each ranking to its closest centroid. Using the proper combination of scoring rules (for aggregation) and distances (for assignment), it is possible to perform clustering in a computationally efficient way and as well account for specific desired behaviors (give more weight to top positions, bias the centroids in favor of particular items).
Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering
Eldridge, Justin, Belkin, Mikhail, Wang, Yusu
Hierarchical clustering is a popular method for analyzing data which associates a tree to a dataset. Hartigan consistency has been used extensively as a framework to analyze such clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree which is Hartigan consistent with a given density can look very different than the correct limit tree. Specifically, Hartigan consistency permits two types of undesirable configurations which we term over-segmentation and improper nesting. Moreover, Hartigan consistency is a limit property and does not directly quantify difference between trees. In this paper we identify two limit properties, separation and minimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency. We proceed to introduce a merge distortion metric between hierarchical clusterings and show that convergence in our distance implies both separation and minimality. We also prove that uniform separation and minimality imply convergence in the merge distortion metric. Furthermore, we show that our merge distortion metric is stable under perturbations of the density. Finally, we demonstrate applicability of these concepts by proving convergence results for two clustering algorithms. First, we show convergence (and hence separation and minimality) of the recent robust single linkage algorithm of Chaudhuri and Dasgupta (2010). Second, we provide convergence results on manifolds for topological split tree clustering.
Analysis of Microarray Data using Artificial Intelligence Based Techniques
The bioinformatics is an interdisciplinary area of study where one of the objectives is to deal with the analysis and interpretation of large sets of data generated from various large-scale biological experiments. The example of one such large-scale biological experiment is measuring the expression levels of tens of thousands of genes simultaneously under some environmental condition. Microarray is one of the essential technologies used by the biologist to measure genome-wide expression levels of genes in a particular organism. As microarrays technologies have become more prevalent, the challenges 1 associated with collecting, managing, and analyzing the data from each experiment have essentially increased. Robust laboratory protocols, improved understanding of the complex experimental design and falling prices of commercial platforms, all these have combined to drive the field to more complex experiments, generating huge amounts of data (Brazma and Vilo, 2000).
LogDet Rank Minimization with Application to Subspace Clustering
Kang, Zhao, Peng, Chong, Cheng, Jie, Chen, Qiang
Low-rank matrix is desired in many machine learning and computer vision problems. Most of the recent studies use the nuclear norm as a convex surrogate of the rank operator. However, all singular values are simply added together by the nuclear norm, and thus the rank may not be well approximated in practical problems. In this paper, we propose to use a log-determinant (LogDet) function as a smooth and closer, though non-convex, approximation to rank for obtaining a low-rank representation in subspace clustering. Augmented Lagrange multipliers strategy is applied to iteratively optimize the LogDet-based non-convex objective function on potentially large-scale data. By making use of the angular information of principal directions of the resultant low-rank representation, an affinity graph matrix is constructed for spectral clustering. Experimental results on motion segmentation and face clustering data demonstrate that the proposed method often outperforms state-of-the-art subspace clustering algorithms.
Integrative analysis of gene expression and phenotype data
The linking genotype to phenotype is the fundamental aim of modern genetics. We focus on study of links between gene expression data and phenotype data through integrative analysis. We propose three approaches. 1) The inherent complexity of phenotypes makes high-throughput phenotype profiling a very difficult and laborious process. We propose a method of automated multi-dimensional profiling which uses gene expression similarity. Large-scale analysis show that our method can provide robust profiling that reveals different phenotypic aspects of samples. This profiling technique is also capable of interpolation and extrapolation beyond the phenotype information given in training data. It can be used in many applications, including facilitating experimental design and detecting confounding factors. 2) Phenotype association analysis problems are complicated by small sample size and high dimensionality. Consequently, phenotype-associated gene subsets obtained from training data are very sensitive to selection of training samples, and the constructed sample phenotype classifiers tend to have poor generalization properties. To eliminate these obstacles, we propose a novel approach that generates sequences of increasingly discriminative gene cluster combinations. Our experiments on both simulated and real datasets show robust and accurate classification performance. 3) Many complex phenotypes, such as cancer, are the product of not only gene expression, but also gene interaction. We propose an integrative approach to find gene network modules that activate under different phenotype conditions. Using our method, we discovered cancer subtype-specific network modules, as well as the ways in which these modules coordinate. In particular, we detected a breast-cancer specific tumor suppressor network module with a hub gene, PDGFRL, which may play an important role in this module.