Goto

Collaborating Authors

 Clustering


Generalized Cluster Aggregation

AAAI Conferences

Clustering aggregation has emerged as an important extension of the classical clustering problem. It refers to the situation in which a number of different (input) clusterings have been obtained for a particular data set and it is desired to aggregate those clustering results to get a better clustering solution. In this paper, we propose a unified framework to solve the clustering aggregation problem, where the aggregated clustering result is obtained by minimizing the (weighted) sum of the Bregman divergence between it and all the input clusterings. Moreover, under our algorithm framework, we also propose a novel cluster aggregation problem where some must-link and cannot-link constraints are given in addition to the input clusterings. Finally the experimental results on some real world data sets are presented to show the effectiveness of our method.


Spectral Embedded Clustering

AAAI Conferences

In this paper, we propose a new spectral clustering method, referred to as Spectral Embedded Clustering (SEC), to minimize the normalized cut criterion in spectral clustering as well as control the mismatch between the cluster assignment matrix and the low dimensional embedded representation of the data. SEC is based on the observation that the cluster assignment matrix of high dimensional data can be represented by a low dimensional linear mapping of data. We also discover the connection between SEC and other clustering methods, such as spectral clustering, Clustering with local and global regularization, K-means and Discriminative K-means. The experiments on many real-world data sets show that SEC significantly outperforms the existing spectral clustering methods as well as K-means clustering related methods.


Local Learning Regularized Nonnegative Matrix Factorization

AAAI Conferences

Nonnegative Matrix Factorization (NMF) has been widely used in machine learning and data mining. It aims to find two nonnegative matrices whose product can well approximate the nonnegative data matrix, which naturally lead to parts-based representation. In this paper, we present a local learning regularized nonnegative matrix factorization (LLNMF) for clustering. It imposes an additional constraint on NMF that the cluster label of each point can be predicted by the points in its neighborhood. This constraint encodes both the discriminative information and the geometric structure, and is good at clustering data on manifold. An iterative multiplicative updating algorithm is proposed to optimize the objective, and its convergence is guaranteed theoretically. Experiments on many benchmark data sets demonstrate that the proposed method outperforms NMF as well as many state of the art clustering methods.


Knowledge Driven Dimension Reduction For Clustering

AAAI Conferences

However, most dimension reduction approaches are driven by objective functions that may not or only We will provide more detail on our solution to this problem partially suit the end users requirements. In this later but it is important to note the problem of focus in this work, we show how to incorporate general-purpose paper is different to spectral clustering (dimension reduction) domain expertise encoded as a graph into dimension in two keys ways. Firstly, we are projecting the entire space reduction in way that lends itself to an elegant D occupies not just the points in G or D. Secondly, we do generalized eigenvalue problem. We call not formulate the problem as some form of min-cut and then our approach Graph-Driven Constrained Dimension solve a relaxed version of the problem. Reduction via Linear Projection (GCDR-LP) Our work aims to find a reduced dimension space based on and show that it has several desirable properties.


Locality Preserving Nonnegative Matrix Factorization

AAAI Conferences

Matrix factorization techniques have been frequently applied in information processing tasks. Among them, Non-negative Matrix Factorization (NMF) have received considerable attentions due to its psychological and physiological interpretation of naturally occurring data whose representation may be parts-based in human brain. On the other hand, from geometric perspective the data is usually sampled from a low dimensional manifold embedded in high dimensional ambient space. One hopes then to find a compact representation which uncovers the hidden topics and simultaneously respects the intrinsic geometric structure. In this paper, we propose a novel algorithm, called {\em Locality Preserving Non-negative Matrix Factorization} (LPNMF), for this purpose. For two data points, we use KL-divergence to evaluate their similarity on the hidden topics. The optimal maps are obtained such that the feature values on hidden topics are restricted to be non-negative and vary smoothly along the geodesics of the data manifold. Our empirical study shows the encouraging results of the proposed algorithm in comparisons to the state-of-the-art algorithms on two large high-dimensional databases.


Adaptive Cluster Ensemble Selection

AAAI Conferences

Cluster ensembles generate a large number of different clustering solutions and combine them into a more robust and accurate consensus clustering. On forming the ensembles, the literature has suggested that higher diversity among ensemble members produces higher performance gain. In contrast, some studies also indicated that medium diversity leads to the best performing ensembles. Such contradicting observations suggest that different data, with varying characteristics, may require different treatments. We empirically investigate this issue by examining the behavior of cluster ensembles on benchmark data sets. This leads to a novel framework that selects ensemble members for each data set based on its own characteristics. Our framework first generates a diverse set of solutions and combines them into a consensus partition P*. Based on the diversity between the ensemble members and P*, a subset of ensemble members is selected and combined to obtain the final output. We evaluate the proposed method on benchmark data sets and the results show that the proposed method can significantly improve the clustering performance, often by a substantial margin. In some cases, we were able to produce final solutions that significantly outperform even the best ensemble members.


Set Branching in Constraint Optimization

AAAI Conferences

Branch and bound is an effective technique for solving constraint optimization problems (COP’s). However, its search space expands very rapidly as the domain sizes of the problem variables grow. In this paper, we present an algorithm that clusters the values of a variable’s domain into sets. Branch and bound can then branch on these sets of values rather than on individual values, thereby reducing the branching factor of its search space. The aim of our clustering algorithm is to construct a collection of sets such that branching on these sets will still allow effective bounding. In conjunction with the reduced branching factor, the size of the explored search space is thus significantly reduced. We test our method and show empirically that it can yield significant performance gains over existing stateof- the-art techniques.


Symmetry in Data Mining and Analysis: A Unifying View based on Hierarchy

arXiv.org Machine Learning

Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. "Structure" has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Beginning with the role of number theory in expressing data, we show how we can naturally proceed to hierarchical structures. We show how this both encapsulates traditional paradigms in data analysis, and also opens up new perspectives towards issues that are on the order of the day, including data mining of massive, high dimensional, heterogeneous data sets. Linkages with other fields are also discussed including computational logic and symbolic dynamics. The structures in data surveyed here are based on hierarchy, represented as p-adic numbers or an ultrametric topology.


Complex Question Answering: Unsupervised Learning Approaches and Experiments

Journal of Artificial Intelligence Research

Complex questions that require inferencing and synthesizing information from multiple documents can be seen as a kind of topic-oriented, informative multi-document summarization where the goal is to produce a single text as a compressed version of a set of documents with a minimum loss of relevant information. In this paper, we experiment with one empirical method and two unsupervised statistical machine learning techniques: K-means and Expectation Maximization (EM), for computing relative importance of the sentences. We compare the results of these approaches. Our experiments show that the empirical approach outperforms the other two techniques and EM performs better than K-means. However, the performance of these approaches depends entirely on the feature set used and the weighting of these features. In order to measure the importance and relevance to the user query we extract different kinds of features (i.e. lexical, lexical semantic, cosine similarity, basic element, tree kernel based syntactic and shallow-semantic) for each of the document sentences. We use a local search technique to learn the weights of the features. To the best of our knowledge, no study has used tree kernel functions to encode syntactic/semantic information for more complex tasks such as computing the relatedness between the query sentences and the document sentences in order to generate query-focused summaries (or answers to complex questions). For each of our methods of generating summaries (i.e. empirical, K-means and EM) we show the effects of syntactic and shallow-semantic features over the bag-of-words (BOW) features.


Mining Default Rules from Statistical Data

AAAI Conferences

In this paper, we are interested in the qualitative knowledge that underlies some given probabilistic information. To represent such qualitative structures, we use ordinal conditional functions, OCFs, (or ranking functions) as a qualitative abstraction of probability functions. The basic idea for transforming probabilities into ordinal rankings is to find well-behaved clusterings of the negative logarithms of the probabilities. We show how popular clustering tools can be used for this, and propose measures for the evaluation of the clustering results in this context. From the so obtained ranking functions, we extract conditionals that may serve as a base for inductive default reasoning.