Clustering
An Impossibility Theorem for Clustering
Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) tradeoffs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median.
Bayesian Models of Inductive Generalization
Sanjana, Neville E., Tenenbaum, Joshua B.
We argue that human inductive generalization is best explained in a Bayesian framework, rather than by traditional models based on similarity computations. We go beyond previous work on Bayesian concept learning by introducing an unsupervised method for constructing flexible hypothesis spaces, and we propose a version of the Bayesian Occam's razor that trades off priors and likelihoods to prevent under-or over-generalization in these flexible spaces. We analyze two published data sets on inductive reasoning as well as the results of a new behavioral study that we have carried out.
An Impossibility Theorem for Clustering
Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverseapproaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: fora set of three simple properties, we show that there is no clustering function satisfying all three.
Half-Lives of EigenFlows for Spectral Clustering
Chennubhotla, Chakra, Jepson, Allan D.
Using a Markov chain perspective of spectral clustering we present an algorithm to automatically find the number of stable clusters in a dataset. The Markov chain's behaviour is characterized by the spectral properties of the matrix of transition probabilities, from which we derive eigenflows along with their halflives. An eigenflow describes the flow of probability massdue to the Markov chain, and it is characterized by its eigenvalue, orequivalently, by the halflife of its decay as the Markov chain is iterated. A ideal stable cluster is one with zero eigenflow and infinite half-life.The key insight in this paper is that bottlenecks between weakly coupled clusters can be identified by computing the sensitivity of the eigenflow's halflife to variations in the edge weights. We propose a novel EIGENCUTS algorithm to perform clustering that removes these identified bottlenecks in an iterative fashion.
Feature Selection in Mixture-Based Clustering
Law, Martin H., Jain, Anil K., Figueiredo, Mรกrio
There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami's mutual-informationbased featurerelevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a "wrapper", since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.
Stability-Based Model Selection
Lange, Tilman, Braun, Mikio L., Roth, Volker, Buhmann, Joachim M.
Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, whichis not applicable for semi-supervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to cross-validation in the supervised case, but extends to semi-supervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection incomparison to standard techniques in this area.
Distance Metric Learning with Application to Clustering with Side-Information
Xing, Eric P., Jordan, Michael I., Russell, Stuart J., Ng, Andrew Y.
Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many "plausible" ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider "similar."For
Bayesian Models of Inductive Generalization
Sanjana, Neville E., Tenenbaum, Joshua B.
We argue that human inductive generalization is best explained in a Bayesian framework, rather than by traditional models based on similarity computations.We go beyond previous work on Bayesian concept learning by introducing an unsupervised method for constructing flexible hypothesisspaces, and we propose a version of the Bayesian Occam's razorthat trades off priors and likelihoods to prevent under-or over-generalization in these flexible spaces. We analyze two published data sets on inductive reasoning as well as the results of a new behavioral study that we have carried out.