Goto

Collaborating Authors

 Statistical Learning


Optimization with Sparsity-Inducing Penalties

arXiv.org Machine Learning

Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate non-smooth norms. The goal of this paper is to present from a general perspective optimization tools and techniques dedicated to such sparsity-inducing penalties. We cover proximal methods, block-coordinate descent, reweighted $\ell_2$-penalized techniques, working-set and homotopy methods, as well as non-convex formulations and extensions, and provide an extensive set of experiments to compare various algorithms from a computational point of view.


Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization

arXiv.org Machine Learning

Relative to the large literature on upper bounds on complexity of convex optimization, lesser attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining an understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes.


Joint Modeling of Multiple Related Time Series via the Beta Process

arXiv.org Machine Learning

We propose a Bayesian nonparametric approach to the problem of jointly modeling multiple related time series. Our approach is based on the discovery of a set of latent, shared dynamical behaviors. Using a beta process prior, the size of the set and the sharing pattern are both inferred from data. We develop efficient Markov chain Monte Carlo methods based on the Indian buffet process representation of the predictive distribution of the beta process, without relying on a truncated model. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth and death proposals. We examine the benefits of our proposed feature-based model on several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data.


Fast Learning Rate of Non-Sparse Multiple Kernel Learning and Optimal Regularization Strategies

arXiv.org Machine Learning

In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations, and discuss what kind of regularization gives a favorable predictive accuracy. Our main target in this paper is dense type regularizations including \ellp-MKL. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates of MKL that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. This enables us to compare the generalization performances of various types of regularizations. As a consequence, we observe that the homogeneity of the complexities of candidate reproducing kernel Hilbert spaces (RKHSs) affects which regularization strategy (\ell1 or dense) is preferred. In fact, in homogeneous complexity settings where the complexities of all RKHSs are evenly same, \ell1-regularization is optimal among all isotropic norms. On the other hand, in inhomogeneous complexity settings, dense type regularizations can show better learning rate than sparse \ell1-regularization. We also show that our learning rate achieves the minimax lower bound in homogeneous complexity settings.


Estimated VC dimension for risk bounds

arXiv.org Machine Learning

Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the generalization capacity of learning algorithms. However, apart from a few special cases, it is hard or impossible to calculate analytically. Vapnik et al. [10] proposed a technique for estimating the VC dimension empirically. While their approach behaves well in simulations, it could not be used to bound the generalization risk of classifiers, because there were no bounds for the estimation error of the VC dimension itself. We rectify this omission, providing high probability concentration results for the proposed estimator and deriving corresponding generalization bounds.


UPAL: Unbiased Pool Based Active Learning

arXiv.org Artificial Intelligence

In this paper we address the problem of pool based active learning, and provide an algorithm, called UPAL, that works by minimizing the unbiased estimator of the risk of a hypothesis in a given hypothesis space. For the space of linear classifiers and the squared loss we show that UPAL is equivalent to an exponentially weighted average forecaster. Exploiting some recent results regarding the spectra of random matrices allows us to establish consistency of UPAL when the true hypothesis is a linear hypothesis. Empirical comparison with an active learner implementation in Vowpal Wabbit, and a previously proposed pool based active learner implementation show good empirical performance and better scalability.


An Iterative Algorithm for Fitting Nonconvex Penalized Generalized Linear Models with Grouped Predictors

arXiv.org Machine Learning

High-dimensional data pose challenges in statistical learning and modeling. Sometimes the predictors can be naturally grouped where pursuing the between-group sparsity is desired. Collinearity may occur in real-world high-dimensional applications where the popular $l_1$ technique suffers from both selection inconsistency and prediction inaccuracy. Moreover, the problems of interest often go beyond Gaussian models. To meet these challenges, nonconvex penalized generalized linear models with grouped predictors are investigated and a simple-to-implement algorithm is proposed for computation. A rigorous theoretical result guarantees its convergence and provides tight preliminary scaling. This framework allows for grouped predictors and nonconvex penalties, including the discrete $l_0$ and the `$l_0+l_2$' type penalties. Penalty design and parameter tuning for nonconvex penalties are examined. Applications of super-resolution spectrum estimation in signal processing and cancer classification with joint gene selection in bioinformatics show the performance improvement by nonconvex penalized estimation.


Statistical Topic Models for Multi-Label Document Classification

arXiv.org Machine Learning

Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.


The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy

arXiv.org Machine Learning

The popular cubic smoothing spline estimate of a regression function arises as the minimizer of the penalized sum of squares $\sum_j(Y_j - {\mu}(t_j))^2 + {\lambda}\int_a^b [{\mu}"(t)]^2 dt$, where the data are $t_j,Y_j$, $j=1,..., n$. The minimization is taken over an infinite-dimensional function space, the space of all functions with square integrable second derivatives. But the calculations can be carried out in a finite-dimensional space. The reduction from minimizing over an infinite dimensional space to minimizing over a finite dimensional space occurs for more general objective functions: the data may be related to the function ${\mu}$ in another way, the sum of squares may be replaced by a more suitable expression, or the penalty, $\int_a^b [{\mu}"(t)]^2 dt$, might take a different form. This paper reviews the Reproducing Kernel Hilbert Space structure that provides a finite-dimensional solution for a general minimization problem. Particular attention is paid to penalties based on linear differential operators. In this case, one can sometimes easily calculate the minimizer explicitly, using Green's functions.


On the stability of bootstrap estimators

arXiv.org Machine Learning

It is shown that bootstrap approximations of an estimator which is based on a continuous operator from the set of Borel probability measures defined on a compact metric space into a complete separable metric space is stable in the sense of qualitative robustness. Support vector machines based on shifted loss functions are treated as special cases.