hartigan
An Axiomatic Definition of Hierarchical Clustering
Arias-Castro, Ery, Coda, Elizabeth
Clustering, informally understood as the grouping of data, is a central task in statistics and computer science with broad applications. Modern clustering algorithms originated in the work of numerical taxonomists, who developed methods to identify hierarchical structures in the classification of plant and animal species. Since then clustering has been used in disciplines such as medicine, astronomy, anthropology, economics, etc., with aims such as exploratory analysis, data summarization, the identification of salient structures in data, and information organization. The notion of a "good" or "accurate" clustering varies between fields and applications. For example, to some computer scientists, the correct clustering of a dataset is often defined as the solution to an optimization problem (think K-means) and a good algorithm either solves or approximates a solution to this problem, ideally with some guarantees [14, 31].
Level Sets or Gradient Lines? A Unifying View of Modal Clustering
Arias-Castro, Ery, Qiao, Wanli
Up until the 1970's there were two main ways of clustering points in space. One of them, perhaps pioneered by Pearson [44], was to fit a (usually Gaussian) mixture to the data, and that being done, classify each data point -- as well as any other point available at a later date -- according to the most likely component in the mixture. The other one was based on a direct partitioning of the space, most notably by minimization of the average minimum squared distance to a center: the K-means problem, whose computational difficulty led to a number of famous algorithms [22, 31, 36, 37, 39] and likely played a role in motivating the development of hierarchical clustering [21, 25, 54, 63]. In the 1970's, two decidedly nonparametric approaches to clustering were proposed, both based on the topography given by the population density. Of course, in practice, the density is estimated, often by some form of kernel density estimation.
CAC: A Clustering Based Framework for Classification
Srivastava, Shivin, Bhatia, Siddharth, Huang, Lingxiao, Heng, Lim Jun, Kawaguchi, Kenji, Rajan, Vaibhav
In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification either are classifier-specific and not generic or independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature despite its importance in several real-world applications. In this paper, we theoretically analyze when and how clustering may help in obtaining accurate classifiers. We design a simple, efficient, and generic framework called Classification Aware Clustering (CAC), to find clusters that are well suited for being used as training datasets by classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of CAC over previous methods for combined clustering and classification.
k-sums: another side of k-means
Zhao, Wan-Lei, Chen, Run-Qing, Ye, Hui, Ngo, Chong-Wah
In this paper, the decades-old clustering method k-means is revisited. The original distortion minimization model of k-means is addressed by a pure stochastic minimization procedure. In each step of the iteration, one sample is tentatively reallocated from one cluster to another. It is moved to another cluster as long as the reallocation allows the sample to be closer to the new centroid. This optimization procedure converges faster to a better local minimum over k-means and many of its variants. This fundamental modification over the k-means loop leads to the redefinition of a family of k-means variants. Moreover, a new target function that minimizes the summation of pairwise distances within clusters is presented. We show that it could be solved under the same stochastic optimization procedure. This minimization procedure built upon two minimization models outperforms k-means and its variants considerably with different settings and on different datasets.
Energy Clustering
França, Guilherme, Vogelstein, Joshua T.
Energy statistics was proposed by Sz\'{e}kely in the 80's inspired by the Newtonian gravitational potential from classical mechanics, and it provides a hypothesis test for equality of distributions. It was further generalized from Euclidean spaces to metric spaces of strong negative type, and more recently, a connection with reproducing kernel Hilbert spaces (RKHS) was established. Here we consider the clustering problem from an energy statistics theory perspective, providing a precise mathematical formulation yielding a quadratically constrained quadratic program (QCQP) in the associated RKHS, thus establishing the connection with kernel methods. We show that this QCQP is equivalent to kernel $k$-means optimization problem once the kernel is fixed. These results imply a first principles derivation of kernel $k$-means from energy statistics. However, energy statistics fixes a family of standard kernels. Furthermore, we also consider a weighted version of energy statistics, making connection to graph partitioning problems. To find local optimizers of such QCQP we propose an iterative algorithm based on Hartigan's method, which in this case has the same computational cost as kernel $k$-means algorithm, based on Lloyd's heuristic, but usually with better clustering quality. We provide carefully designed numerical experiments showing the superiority of the proposed method compared to kernel $k$-means, spectral clustering, standard $k$-means, and Gaussian mixture models in a variety of settings.
Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means
Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when $k$ or $d$ increases, or when performing several restarts. First, we show that those special events are a blessing because they allow to partially re-seed some cluster centers while further minimizing the $k$-means objective function. Second, we describe a novel heuristic, merge-and-split $k$-means, that consists in merging two clusters and splitting this merged cluster again with two new centers provided it improves the $k$-means objective. This novel heuristic can improve Hartigan's $k$-means when it has converged to a local minimum. We show empirically that this merge-and-split $k$-means improves over the Hartigan's heuristic which is the {\em de facto} method of choice. Finally, we propose the $(k,l)$-means objective that generalizes the $k$-means objective by associating the data points to their $l$ closest cluster centers, and show how to either directly convert or iteratively relax the $(k,l)$-means into a $k$-means in order to reach better local minima.
Hartigan's K-Means Versus Lloyd's K-Means — Is It Time for a Change?
Slonim, Noam (IBM Haifa Research Lab) | Aharoni, Ehud (IBM Haifa Research Lab) | Crammer, Koby (The Technion Department of Electrical Engineering)
Hartigan's method for k-means clustering holds several potential advantages compared to the classical and prevalent optimization heuristic known as Lloyd's algorithm. E.g., it was recently shown that the set of local minima of Hartigan's algorithm is a subset of those of Lloyd's method. We develop a closed-form expression that allows to establish Hartigan's method for k-means clustering with any Bregman divergence, and further strengthen the case of preferring Hartigan's algorithm over Lloyd's algorithm. Specifically, we characterize a range of problems with various noise levels of the inputs, for which any random partition represents a local minimum for Lloyd's algorithm, while Hartigan's algorithm easily converges to the correct solution. Extensive experiments on synthetic and real-world data further support our theoretical analysis.