Goto

Collaborating Authors

 Statistical Learning


Measures of Clustering Quality: A Working Set of Axioms for Clustering

Neural Information Processing Systems

Aiming towards the development of a general clustering theory, we discuss abstract axiomatization for clustering. In this respect, we follow up on the work of Kelinberg, (Kleinberg) that showed an impossibility result for such axiomatization. We argue that an impossibility result is not an inherent feature of clustering, but rather, to a large extent, it is an artifact of the specific formalism used in Kleinberg. As opposed to previous work focusing on clustering functions, we propose to address clustering quality measures as the primitive object to be axiomatized. We show that principles like those formulated in Kleinberg's axioms can be readily expressed in the latter framework without leading to inconsistency. A clustering-quality measure is a function that, given a data set and its partition into clusters, returns a non-negative real number representing how `strong' or `conclusive' the clustering is. We analyze what clustering-quality measures should look like and introduce a set of requirements (`axioms') that express these requirement and extend the translation of Kleinberg's axioms to our framework. We propose several natural clustering quality measures, all satisfying the proposed axioms. In addition, we show that the proposed clustering quality can be computed in polynomial time.


Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Neural Information Processing Systems

Given $n$ noisy samples with $p$ dimensions, where $n \ll p$, we show that the multi-stage thresholding procedures can accurately estimate a sparse vector $\beta \in \R^p$ in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of $s$, which is the number of non-zero elements in the true parameter $\beta$. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if $X$ obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand\{e}s-Tao 07) achieves the $\ell_2$ loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model.


Adaptive Regularization for Transductive Support Vector Machine

Neural Information Processing Systems

We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms.


Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Neural Information Processing Systems

We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as L1-norm for sparsity. We develop a new online algorithm, the regularized dual averaging method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. This method achieves the optimal convergence rate and often enjoys a low complexity per iteration similar as the standard stochastic gradient method. Computational experiments are presented for the special case of sparse online learning using L1-regularization.


Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

Neural Information Processing Systems

Principal component analysis is a fundamental operation in computational data analysis, with myriad applications ranging from web search to bioinformatics to computer vision and image analysis. However, its performance and applicability in real scenarios are limited by a lack of robustness to outlying or corrupted observations. Thispaper considers the idealized "robust principal component analysis" problem of recovering a low rank matrix A from corrupted observations D A E. Here, the corrupted entries E are unknown and the errors can be arbitrarily large (modeling grossly corrupted observations common in visual and bioinformatic data), but are assumed to be sparse. We prove that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns bysolving a simple convex program, for which we give a fast and provably convergent algorithm. Our result holds even when the rank of A grows nearly proportionally (up to a logarithmic factor) to the dimensionality of the observation spaceand the number of errors E grows in proportion to the total number of entries in the matrix. A byproduct of our analysis is the first proportional growth results for the related problem of completing a low-rank matrix from a small fraction ofits entries. Simulations and real-data examples corroborate the theoretical results, and suggest potential applications in computer vision.


Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

Neural Information Processing Systems

We prove certain theoretical properties of a graph-regularized transductive learning objective that is based on minimizing a Kullback-Leibler divergence based loss. These include showing that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence. We also propose a graph node ordering algorithm that is cache cognizant and leads to a linear speedup in parallel computations. This ensures that the algorithm scales to large data sets. By making use of empirical evaluation on the TIMIT and Switchboard I corpora, we show this approach is able to out-perform other state-of-the-art SSL approaches. In one instance, we solve a problem on a 120 million node graph.



Learning in Markov Random Fields using Tempered Transitions

Neural Information Processing Systems

Markov random fields (MRFs), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRFs is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of Robbins-Monro type that uses Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large densely-connected MRFs. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data and perform well on digit and object recognition tasks.


Linear-time Algorithms for Pairwise Statistical Problems

Neural Information Processing Systems

Several key computational bottlenecks in machine learning involve pairwise distance computations, including all-nearest-neighbors (finding the nearest neighbor(s) for each point, e.g. in manifold learning) and kernel summations (e.g. in kernel density estimation or kernel machines). We consider the general, bichromatic case for these problems, in addition to the scientific problem of N-body potential calculation. In this paper we show for the first time O(N) worst case runtimes for practical algorithms for these problems based on the cover tree data structure (Beygelzimer, Kakade, Langford, 2006).


Multi-Label Prediction via Sparse Infinite CCA

Neural Information Processing Systems

Canonical Correlation Analysis (CCA) is a useful technique for modeling dependencies between two (or more) sets of variables. Building upon the recently suggested probabilistic interpretation of CCA, we propose a nonparametric, fully Bayesian framework that can automatically select the number of correlation components, and effectively capture the sparsity underlying the projections. In addition, given (partially) labeled data, our algorithm can also be used as a (semi)supervised dimensionality reduction technique, and can be applied to learn useful predictive features in the context of learning a set of related tasks. Experimental results demonstrate the efficacy of the proposed approach for both CCA as a stand-alone problem, and when applied to multi-label prediction.