Clustering
Multitask Bregman Clustering
Zhang, Jianwen (Tsinghua University) | Zhang, Changshui (Tsinghua University)
Traditional clustering methods deal with a single clustering task on a single data set. However, in some newly emerging applications, multiple similar clustering tasks are involved simultaneously. In this case, we not only desire a partition for each task, but also want to discover the relationship among clusters of different tasks. It's also expected that the learnt relationship among tasks can improve performance of each single task. In this paper, we propose a general framework for this problem and further suggest a specific approach. In our approach, we alternatively update clusters and learn relationship between clusters of different tasks, and the two phases boost each other. Our approach is based on the general Bregman divergence, hence it's suitable for a large family of assumptions on data distributions and divergences. Empirical results on several benchmark data sets validate the approach.
Constrained Coclustering for Textual Documents
Song, Yangqiu (IBM Research - China) | Pan, Shimei (IBM T. J. Watson Research Center) | Liu, Shixia (IBM Research - China) | Wei, Furu (IBM Research - China) | Zhou, Michelle X. (IBM Research - Almaden Center) | Qian, Weihong (IBM Research - China)
In this paper, we present a constrained co-clustering approach for clustering textual documents. Our approach combines the benefits of information-theoretic co-clustering and constrained clustering. We use a two-sided hidden Markov random field (HMRF) to model both the document and word constraints. We also develop an alternating expectation maximization (EM) algorithm to optimize the constrained co-clustering model. We have conducted two sets of experiments on a benchmark data set: (1) using human-provided category labels to derive document and word constraints for semi-supervised document clustering, and (2) using automatically extracted named entities to derive document constraints for unsupervised document clustering. Compared to several representative constrained clustering and co-clustering approaches, our approach is shown to be more effective for high-dimensional, sparse text data.
Non-Negative Matrix Factorization Clustering on Multiple Manifolds
Shen, Bin (Purdue University) | Si, Luo (Purdue University)
Nonnegative Matrix Factorization (NMF) is a widely used technique in many applications such as clustering. It approximates the nonnegative data in an original high dimensional space with a linear representation in a low dimensional space by using the product of two nonnegative matrices. In many applications with data such as human faces or digits, data often reside on multiple manifolds, which may overlap or intersect. But the traditional NMF method and other existing variants of NMF do not consider this. This paper proposes a novel clustering algorithm that explicitly models the intrinsic geometrical structure of the data on multiple manifolds with NMF. The idea of the proposed algorithm is that a data point generated by several neighboring points on a specific manifold in the original space should be constructed in a similar way in the low dimensional subspace. A set of experimental results on two real world datasets demonstrate the advantage of the proposed algorithm.
Gaussian Mixture Model with Local Consistency
Liu, Jialu (Zhejiang University) | Cai, Deng (Zhejiang University) | He, Xiaofei (Zhejiang University)
Gaussian Mixture Model (GMM) is one of the most popular data clustering methods which can be viewed as a linear combination of different Gaussian components. In GMM, each cluster obeys Gaussian distribution and the task of clustering is to group observations into different components through estimating each cluster's own parameters. The Expectation-Maximization algorithm is always involved in such estimation problem. However, many previous studies have shown naturally occurring data may reside on or close to an underlying submanifold. In this paper, we consider the case where the probability distribution is supported on a submanifold of the ambient space. We take into account the smoothness of the conditional probability distribution along the geodesics of data manifold. That is, if two observations are close in intrinsic geometry, their distributions over different Gaussian components are similar. Simply speaking, we introduce a novel method based on manifold structure for data clustering, called Locally Consistent Gaussian Mixture Model (LCGMM). Specifically, we construct a nearest neighbor graph and adopt Kullback-Leibler Divergence as the distance measurement to regularize the objective function of GMM. Experiments on several data sets demonstrate the effectiveness of such regularization.
A Topic Model for Linked Documents and Update Rules for its Estimation
Guo, Zhen (State University of New York at Binghamton) | Zhu, Shenghuo (NEC Laboratories America, Inc.) | Zhang, Zhongfei (State University of New York at Binghamton) | Chi, Yun (NEC Laboratories America, Inc.) | Gong, Yihong (NEC Laboratories America, Inc.)
The latent topic model plays an important role in the unsupervised learning from a corpus, which provides a probabilistic interpretation of the corpus in terms of the latent topic space. An underpinning assumption which most of the topic models are based on is that the documents are assumed to be independent of each other. However, this assumption does not hold true in reality and the relations among the documents are available in different ways, such as the citation relations among the research papers. To address this limitation, in this paper we present a Bernoulli Process Topic (BPT) model, where the interdependence among the documents is modeled by a random Bernoulli process. In the BPT model a document is modeled as a distribution over topics that is a mixture of the distributions associated with the related documents. Although BPT aims at obtaining a better document modeling by incorporating the relations among the documents, it could also be applied to many applications including detecting the topics from corpora and clustering the documents. We apply the BPT model to several document collections and the experimental comparisons against several state-of-the-art approaches demonstrate the promising performance.
Exact Algorithms and Experiments for Hierarchical Tree Clustering
Hartung, Sepp (University of Jena) | Guo, Jiong (Universität des Saarlandes) | Komusiewicz, Christian (University of Jena) | Niedermeier, Rolf (University of Jena) | Uhlmann, Johannes (University of Jena)
We perform new theoretical as well as first-time experimental studies for the NP-hard problem to find a closest ultrametric for given dissimilarity data on pairs. This is a central problem in the area of hierarchical clustering, where so far only polynomial-time approximation algorithms were known. In contrast, we develop efficient preprocessing algorithms (known as kernelization in parameterized algorithmics) with provable performance guarantees and a simple search tree algorithm. These are used to find optimal solutions. Our experiments with synthetic and biological data show the effectiveness of our algorithms and demonstrate that an approximation algorithm due to Ailon and Charikar [FOCS 2005] often gives (almost) optimal solutions.
Clustering Stability: An Overview
A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications.
Euclidean Distances, soft and spectral Clustering on Weighted Graphs
We define a class of Euclidean distances on weighted graphs, enabling to perform thermodynamic soft graph clustering. The class can be constructed form the "raw coordinates" encountered in spectral clustering, and can be extended by means of higher-dimensional embeddings (Schoenberg transformations). Geographical flow data, properly conditioned, illustrate the procedure as well as visualization aspects.
An Analysis of Current Trends in CBR Research Using Multi-View Clustering
Greene, Derek (University College Dublin) | Freyne, Jill (CSIRO) | Smyth, Barry (University College Dublin) | Cunningham, Pádraig (University College Dublin)
The European Conference on Case-Based Reasoning (CBR) in 2008 marked 15 years of international and European CBR conferences where almost seven hundred research papers were published. In this report we review the research themes covered in these papers and identify the topics that are active at the moment. The main mechanism for this analysis is a clustering of the research papers based on both co-citation links and text similarity. It is interesting to note that the core set of papers has attracted citations from almost three thousand papers outside the conference collection so it is clear that the CBR conferences are a sub-part of a much larger whole. It is remarkable that the research themes revealed by this analysis do not map directly to the sub-topics of CBR that might appear in a textbook. Instead they reflect the applications-oriented focus of CBR research, and cover the promising application areas and research challenges that are faced.
Segmentation and Nodal Points in Narrative: Study of Multiple Variations of a Ballad
The Lady Maisry ballads afford us a framework within which to segment a storyline into its major components. Segments and as a consequence nodal points are discussed for nine different variants of the Lady Maisry story of a (young) woman being burnt to death by her family, on account of her becoming pregnant by a foreign personage. We motivate the importance of nodal points in textual and literary analysis. We show too how the openings of the nine variants can be analyzed comparatively, and also the conclusions of the ballads.