Not enough data to create a plot.
Try a different view from the menu above.
Bach, Francis R.
Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Schmidt, Mark, Roux, Nicolas L., Bach, Francis R.
We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the second term. We show that the basic proximal-gradient method, the basic proximal-gradient method with a strong convexity assumption, and the accelerated proximal-gradient method achieve the same convergence rates as in the error-free case, provided the errors decrease at an appropriate rate. Our experimental results on a structured sparsity problem indicate that sequences of errors with these appealing theoretical properties can lead to practical performance improvements.
Trace Lasso: a trace norm regularization for correlated designs
Grave, Edouard, Obozinski, Guillaume R., Bach, Francis R.
Using the $\ell_1$-norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm of the selected covariates, which is a convex surrogate of their rank, as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net.
Network Flow Algorithms for Structured Sparsity
Mairal, Julien, Jenatton, Rodolphe, Bach, Francis R., Obozinski, Guillaume R.
We consider a class of learning problems that involve a structured sparsity-inducing norm defined as the sum of $\ell_\infty$-norms over groups of variables. Whereas a lot of effort has been put in developing fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlapping groups. To this end, we show that the corresponding optimization problem is related to network flow optimization. More precisely, the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow problem. We propose an efficient procedure which computes its solution exactly in polynomial time. Our algorithm scales up to millions of groups and variables, and opens up a whole new range of applications for structured sparse models. We present several experiments on image and video data, demonstrating the applicability and scalability of our approach for various problems.
Online Learning for Latent Dirichlet Allocation
Hoffman, Matthew, Bach, Francis R., Blei, David M.
We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time.
Efficient Optimization for Discriminative Latent Class Models
Joulin, Armand, Ponce, Jean, Bach, Francis R.
Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression; we show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While expectation-maximization (EM) algorithm is commonly used to learn these models, its optimization usually leads to local minimum because it relies on a non-convex cost function with many such local minima. To avoid this problem, we introduce a local approximation of this cost function, which leads to a quadratic non-convex optimization problem over a product of simplices. In order to minimize such functions, we propose an efficient algorithm based on convex relaxation and low-rank representation of our data, which allows to deal with large instances. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods.
Structured sparsity-inducing norms through submodular functions
Bach, Francis R.
Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the L1-norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lovasz extension, a common tool in submodular analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
Asymptotically Optimal Regularization in Smooth Parametric Models
Liang, Percy S., Bouchard, Guillaume, Bach, Francis R., Jordan, Michael I.
Many types of regularization schemes have been employed in statistical learning, each one motivated by some assumption about the problem domain. In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning.
Kernel Change-point Analysis
Harchaoui, Zaïd, Moulines, Eric, Bach, Francis R.
We introduce a kernel-based method for change-point analysis within a sequence of temporal observations. Change-point analysis of an (unlabelled) sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating the change-point instant after which the distribution of the observations switches from one distribution to another different distribution. We propose a test statistics based upon the maximum kernel Fisher discriminant ratio as a measure of homogeneity between segments. We derive its limiting distribution under the null hypothesis (no change occurs), and establish the consistency under the alternative hypothesis (a change occurs). This allows to build a statistical hypothesis testing procedure for testing the presence of change-point, with a prescribed false-alarm probability and detection probability tending to one in the large-sample setting. If a change actually occurs, the test statistics also yields an estimator of the change-point location. Promising experimental results in temporal segmentation of mental tasks from BCI data and pop song indexation are presented.
Supervised Dictionary Learning
Mairal, Julien, Ponce, Jean, Sapiro, Guillermo, Zisserman, Andrew, Bach, Francis R.
It is now well established that sparse signal models are well suited to restoration tasks and can effectively be learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of purely reconstructive ones. This paper proposes a new step in that direction with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary and multiple decision functions. It is shown that the linear variant of the model admits a simple probabilistic interpretation, and that its most general variant also admits a simple interpretation in terms of kernels. An optimization framework for learning all the components of the proposed model is presented, along with experiments on standard handwritten digit and texture classification tasks.
Sparse probabilistic projections
Archambeau, Cédric, Bach, Francis R.
We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a pre-processing tool for the construction of template attacks.