Goto

Collaborating Authors

 Clustering


C-Link: A Hierarchical Clustering Approach to Large-scale Near-optimal Coalition Formation

AAAI Conferences

Coalition formation is a fundamental approach to multi-agent coordination. In this paper we address the specific problem of coalition structure generation, and focus on providing good-enough solutions using a novel heuristic approach that is based on data clustering methods. In particular, we propose a hierarchical agglomerative clustering approach (C-Link), which uses a similarity criterion between coalitions based on the gain that the system achieves if two coalitions merge. We empirically evaluate C-Link on a synthetic benchmark data-set as well as in collective energy purchasing settings. Our results show that the C-link approach performs very well against an optimal benchmark based on Mixed-Integer Programming, achieving solutions which are in the worst case about 80% of the optimal (in the synthetic data-set), and 98% of the optimal (in the energy data-set). Thus we show that C-Link can return solutions for problems involving thousands of agents within minutes.


DeBaCl: A Python Package for Interactive DEnsity-BAsed CLustering

arXiv.org Machine Learning

The level set tree approach of Hartigan (1975) provides a probabilistically based and highly interpretable encoding of the clustering behavior of a dataset. By representing the hierarchy of data modes as a dendrogram of the level sets of a density estimator, this approach offers many advantages for exploratory analysis and clustering, especially for complex and high-dimensional data. Several R packages exist for level set tree estimation, but their practical usefulness is limited by computational inefficiency, absence of interactive graphical capabilities and, from a theoretical perspective, reliance on asymptotic approximations. To make it easier for practitioners to capture the advantages of level set trees, we have written the Python package DeBaCl for DEnsity-BAsed CLustering. In this article we illustrate how DeBaCl's level set tree estimates can be used for difficult clustering tasks and interactive graphical data analysis. The package is intended to promote the practical use of level set trees through improvements in computational efficiency and a high degree of user customization. In addition, the flexible algorithms implemented in DeBaCl enjoy finite sample accuracy, as demonstrated in recent literature on density clustering. Finally, we show the level set tree framework can be easily extended to deal with functional data. Keywords: density-based clustering, level set tree, Python, interactive graphics, functional data analysis.


Cluster Trees on Manifolds

arXiv.org Machine Learning

In this paper we investigate the problem of estimating the cluster tree for a density $f$ supported on or near a smooth $d$-dimensional manifold $M$ isometrically embedded in $\mathbb{R}^D$. We analyze a modified version of a $k$-nearest neighbor based algorithm recently proposed by Chaudhuri and Dasgupta. The main results of this paper show that under mild assumptions on $f$ and $M$, we obtain rates of convergence that depend on $d$ only but not on the ambient dimension $D$. We also show that similar (albeit non-algorithmic) results can be obtained for kernel density estimators. We sketch a construction of a sample complexity lower bound instance for a natural class of manifold oblivious clustering algorithms. We further briefly consider the known manifold case and show that in this case a spatially adaptive algorithm achieves better rates.


Performance comparison of State-of-the-art Missing Value Imputation Algorithms on Some Bench mark Datasets

arXiv.org Machine Learning

The presence of missing values influences the selection of appropriate set of attributes that render degradation in classification accuracies of the classifiers. Missing values are a common problem in almost all real world data sets [1] used in knowledge discovery and data mining(KDD) applications. Specifically they are more frequent in clinical databases [2, 3, 4] and temporal climate databases [5, 6]. Their presence would greatly affect the performance of classifiers [7]. The missing values in the databases may arise due various reasons such as the value being lost (erased or deleted) or not recorded, incorrect measurements, equipment errors, or possibly due to an expert not attaching any importance to a particular procedure. The incomplete data can be identified by looking for null values in the data set. However, this is not always true, since missing values can appear in the form of outliers or even wrong data (i.e.


The Cluster Graphical Lasso for improved estimation of Gaussian graphical models

arXiv.org Machine Learning

We consider the task of estimating a Gaussian graphical model in the high-dimensional setting. The graphical lasso, which involves maximizing the Gaussian log likelihood subject to an l1 penalty, is a well-studied approach for this task. We begin by introducing a surprising connection between the graphical lasso and hierarchical clustering: the graphical lasso in effect performs a two-step procedure, in which (1) single linkage hierarchical clustering is performed on the variables in order to identify connected components, and then (2) an l1-penalized log likelihood is maximized on the subset of variables within each connected component. In other words, the graphical lasso determines the connected components of the estimated network via single linkage clustering. Unfortunately, single linkage clustering is known to perform poorly in certain settings. Therefore, we propose the cluster graphical lasso, which involves clustering the features using an alternative to single linkage clustering, and then performing the graphical lasso on the subset of variables within each cluster. We establish model selection consistency for this technique, and demonstrate its improved performance relative to the graphical lasso in a simulation study, as well as in applications to an equities data set, a university webpage data set, and a gene expression data set.


Fast Algorithm for Modularity-Based Graph Clustering

AAAI Conferences

In AI and Web communities, modularity-based graph clustering algorithms are being applied to various applications. However, existing algorithms are not applied to large graphs because they have to scan all vertices/edges iteratively. The goal of this paper is to efficiently compute clusters with high modularity from extremely large graphs with more than a few billion edges. The heart of our solution is to compute clusters by incrementally pruning unnecessary vertices/edges and optimizing the order of vertex selections. Our experiments show that our proposal outperforms all other modularity-based algorithms in terms of computation time, and it finds clusters with high modularity.


Towards Cohesive Anomaly Mining

AAAI Conferences

In some applications, such as bioinformatics, social network analysis, and computational criminology, it is desirable to find compact clusters formed by a (very) small portion of objects in a large data set. Since such clusters are comprised of a small number of objects, they are extraordinary and anomalous with respect to the entire data set. This specific type of clustering task cannot be solved well by the conventional clustering methods since generally those methods try to assign most of the data objects into clusters. In this paper, we model this novel and application-inspired task as the problem of mining cohesive anomalies. We propose a general framework and a principled approach to tackle the problem. The experimental results on both synthetic and real data sets verify the effectiveness and efficiency of our approach.


Spectral Rotation versus K-Means in Spectral Clustering

AAAI Conferences

Spectral clustering has been a popular data clustering algorithm. This category of approaches often resort to other clustering methods, such as K-Means, to get the final cluster. The potential flaw of such common practice is that the obtained relaxed continuous spectral solution could severely deviate from the true discrete solution. In this paper, we propose to impose an additional orthonormal constraint to better approximate the optimal continuous solution to the graph cut objective functions. Such a method, called spectral rotation in literature, optimizes the spectral clustering objective functions better than K -Means, and improves the clustering accuracy. We would provide efficient algorithm to solve the new problem rigorously, which is not significantly more costly than K-Means. We also establish the connection between our method andK-Means to provide theoretical motivation of our method. Experimental results show that our algorithm consistently reaches better cut and meanwhile outperforms in clustering metrics than classic spectral clustering methods.


Convex Subspace Representation Learning from Multi-View Data

AAAI Conferences

Learning from multi-view data is important in many applications. In this paper, we propose a novel convex subspace representation learning method for unsupervised multi-view clustering. We first formulate the subspace learning with multiple views as a joint optimization problem with a common subspace representation matrix and a group sparsity inducing norm. By exploiting the properties of dual norms, we then show a convex min-max dual formulation with a sparsity inducing trace norm can be obtained. We develop a proximal bundle optimization algorithm to globally solve the min-max optimization problem. Our empirical study shows the proposed subspace representation learning method can effectively facilitate multi-view clustering and induce superior clustering results than alternative multi-view clustering methods.


Formalizing Hierarchical Clustering as Integer Linear Programming

AAAI Conferences

Hierarchical clustering is typically implemented as a greedy heuristic algorithm with no explicit objective function. In this work we formalize hierarchical clustering as an integer linear programming (ILP) problem with a natural objective function and the dendrogram properties enforced as linear constraints.  Though exact solvers exists for ILP we show that a simple randomized algorithm and a linear programming (LP) relaxation can be used to provide approximate solutions faster.  Formalizing hierarchical clustering also has the benefit that relaxing the constraints can produce novel problem variations such as overlapping clusterings.  Our experiments show that our formulation is capable of outperforming standard agglomerative clustering algorithms in a variety of settings, including traditional hierarchical clustering as well as learning overlapping clusterings.