Goto

Collaborating Authors

 Clustering


Feature Selection For High-Dimensional Clustering

arXiv.org Machine Learning

There are many methods for feature selection in high-dimensional classification and regression. These methods require assumptions such as sparsity and incoherence. Some methods (Fan and Lv 2008) also assume that relevant variables are detectable through marginal correlations. Given these assumptions, one can prove guarantees for the performance of the method. A similar theory for feature selection in clustering is lacking. There exist a number of methods but they do not come with precise assumptions and guarantees. In this paper we propose a method involving two steps: 1. A screening step to eliminate uninformative features.


Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

arXiv.org Machine Learning

We consider the problem of clustering data points in high dimensions, i.e. when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Additionally, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions.


Compressed Gaussian Process

arXiv.org Machine Learning

Nonparametric regression for massive numbers of samples (n) and features (p) is an increasingly important problem. In big n settings, a common strategy is to partition the feature space, and then separately apply simple models to each partition set. We propose an alternative approach, which avoids such partitioning and the associated sensitivity to neighborhood choice and distance metrics, by using random compression combined with Gaussian process regression. The proposed approach is particularly motivated by the setting in which the response is conditionally independent of the features given the projection to a low dimensional manifold. Conditionally on the random compression matrix and a smoothness parameter, the posterior distribution for the regression surface and posterior predictive distributions are available analytically. Running the analysis in parallel for many random compression matrices and smoothness parameters, model averaging is used to combine the results. The algorithm can be implemented rapidly even in very big n and p problems, has strong theoretical justification, and is found to yield state of the art predictive performance.


Online Clustering of Bandits

arXiv.org Machine Learning

We introduce a novel algorithmic approach to content recommendation based on adaptive clustering of exploration-exploitation ("bandit") strategies. We provide a sharp regret analysis of this algorithm in a standard stochastic noise setting, demonstrate its scalability properties, and prove its effectiveness on a number of artificial and real-world datasets. Our experiments show a significant increase in prediction performance over state-of-the-art methods for bandit problems.


Consistent procedures for cluster tree estimation and pruning

arXiv.org Machine Learning

For a density $f$ on ${\mathbb R}^d$, a {\it high-density cluster} is any connected component of $\{x: f(x) \geq \lambda\}$, for some $\lambda > 0$. The set of all high-density clusters forms a hierarchy called the {\it cluster tree} of $f$. We present two procedures for estimating the cluster tree given samples from $f$. The first is a robust variant of the single linkage algorithm for hierarchical clustering. The second is based on the $k$-nearest neighbor graph of the samples. We give finite-sample convergence rates for these algorithms which also imply consistency, and we derive lower bounds on the sample complexity of cluster tree estimation. Finally, we study a tree pruning procedure that guarantees, under milder conditions than usual, to remove clusters that are spurious while recovering those that are salient.


Optimal Data Collection For Informative Rankings Expose Well-Connected Graphs

arXiv.org Machine Learning

Given a graph where vertices represent alternatives and arcs represent pairwise comparison data, the statistical ranking problem is to find a potential function, defined on the vertices, such that the gradient of the potential function agrees with the pairwise comparisons. Our goal in this paper is to develop a method for collecting data for which the least squares estimator for the ranking problem has maximal Fisher information. Our approach, based on experimental design, is to view data collection as a bi-level optimization problem where the inner problem is the ranking problem and the outer problem is to identify data which maximizes the informativeness of the ranking. Under certain assumptions, the data collection problem decouples, reducing to a problem of finding multigraphs with large algebraic connectivity. This reduction of the data collection problem to graph-theoretic questions is one of the primary contributions of this work. As an application, we study the Yahoo! Movie user rating dataset and demonstrate that the addition of a small number of well-chosen pairwise comparisons can significantly increase the Fisher informativeness of the ranking. As another application, we study the 2011-12 NCAA football schedule and propose schedules with the same number of games which are significantly more informative. Using spectral clustering methods to identify highly-connected communities within the division, we argue that the NCAA could improve its notoriously poor rankings by simply scheduling more out-of-conference games.


Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

arXiv.org Machine Learning

Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the $L_2$-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.


Minimal Dirichlet energy partitions for graphs

arXiv.org Machine Learning

Motivated by a geometric problem, we introduce a new non-convex graph partitioning objective where the optimality criterion is given by the sum of the Dirichlet eigenvalues of the partition components. A relaxed formulation is identified and a novel rearrangement algorithm is proposed, which we show is strictly decreasing and converges in a finite number of iterations to a local minimum of the relaxed objective function. Our method is applied to several clustering problems on graphs constructed from synthetic data, MNIST handwritten digits, and manifold discretizations. The model has a semi-supervised extension and provides a natural representative for the clusters as well.


Modelling Data Dispersion Degree in Automatic Robust Estimation for Multivariate Gaussian Mixture Models with an Application to Noisy Speech Processing

arXiv.org Machine Learning

The trimming scheme with a prefixed cutoff portion is known as a method of improving the robustness of statistical models such as multivariate Gaussian mixture models (MG- MMs) in small scale tests by alleviating the impacts of outliers. However, when this method is applied to real- world data, such as noisy speech processing, it is hard to know the optimal cut-off portion to remove the outliers and sometimes removes useful data samples as well. In this paper, we propose a new method based on measuring the dispersion degree (DD) of the training data to avoid this problem, so as to realise automatic robust estimation for MGMMs. The DD model is studied by using two different measures. For each one, we theoretically prove that the DD of the data samples in a context of MGMMs approximately obeys a specific (chi or chi-square) distribution. The proposed method is evaluated on a real-world application with a moderately-sized speaker recognition task. Experiments show that the proposed method can significantly improve the robustness of the conventional training method of GMMs for speaker recognition.


Strategy Mining

AAAI Conferences

Strategy mining is a new area of research about discovering strategies for decision-making. It is motivated by how similarity is assessed in retrospect in law. In the legal domain, when both case facts and court decisions are present, it is often useful to assess similarity by accounting for both case facts and case outcomes. In this paper, we formulate the strategy mining problem as a clustering problem with the goal of finding clusters that represent disparate conditional dependency of decision labels on other features. Existing clustering algorithms are inappropriate to cluster dependency because they either assume feature independence, such as K-means, or only consider the co-occurrence of features without explicitly modeling the special dependency of the decision label on other features, such as Latent Dirichlet Allocation (LDA). We propose an Expectation Maximization (EM) style unsupervised learning algorithm for dependency clustering. Like EM, our algorithm is grounded in statistical learning theory. It minimizes the empirical risk of decision tree learning. Unlike other clustering algorithms, our algorithm is irrelevant-feature resistant, and its learned clusters modeled by decision trees are strongly interpretable and predictive. We systematically evaluate both the convergence property and solution quality of our algorithm using a common law dataset comprised of actual cases. Experimental results show that our algorithm significantly outperforms K-means and LDA on clustering dependency