Gradient Descent
8f121ce07d74717e0b1f21d122e04521-Reviews.html
The paper presents a iterative algorithm to a robust principal component matrix factorization. The data is modeled as a sum of a low rank matrix approximation and a sparse noise matrix. Constraining the norms of the row and column factors of the former part of the sum allows to implement the nuclear norm minimization of the batch robust pca algorithm in an online stochastic gradient descent fashion. The approach taken by the authors resembles very much the approach taken by Marial et al 2009 (ICML) and 2010 (JMLR), only that the objective is slightly different (Robust PCA was not dealt with in the JMLR version). One difference between the JMLR and the ICML version is that a standard stochastic gradient version of the objective function did not perform as well as the proposed online dictionary learning approach (in which the statistics of the data are accumulated in the matrices A and B) in the ICML version, but in the JMLR version, a standard stochastic gradient implementation with appropriately chosen learning rate seemed to perform ok.
Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
In this work we develop efficient methods for learning random MAP predictors for structured label problems. In particular, we construct posterior distributions over perturbations that can be adjusted via stochastic gradient methods. We show that any smooth posterior distribution would suffice to define a smooth PAC-Bayesian risk bound suitable for gradient methods. In addition, we relate the posterior distributions to computational properties of the MAP predictors. We suggest multiplicative posteriors to learn super-modular potential functions that accompany specialized MAP predictors such as graph-cuts. We also describe label-augmented posterior models that can use efficient MAP approximations, such as those arising from linear program relaxations.
Non-strongly-convex smooth stochastic approximation with convergence rateO(1/n)
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which includes machine learning methods based on the minimization of the empirical risk. We focus on problems without strong convexity, for which all previously known algorithms achieve a convergence rate for function values of O(1/ n) after n iterations. We consider and analyze two algorithms that achieve a rate of O(1/n) for classical supervised learning problems. For least-squares regression, we show that averaged stochastic gradient descent with constant step-size achieves the desired rate. For logistic regression, this is achieved by a simple novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running-time complexity as stochastic gradient descent. For these algorithms, we provide a non-asymptotic analysis of the generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments showing that they often outperform existing approaches.
7380ad8a673226ae47fce7bff88e9c33-Reviews.html
Summary of paper: This is a technical paper on supervised sparse coding. It defines a generalized form of Lasso problem that equates some input variables x (e.g. the input image) to some hidden variables y (e.g. the underlying, analyzed or denoised form of the image) in least squares via two arbitrary weight matrices -- M1*x-M2*y 2 -- adding an L1 penalty on a linear mapping Omega*y and an optional L2 penalty on y itself. This formulation is general enough to cover both analysis and synthesis (generative) problems. The paper solves this system for y(x) via a proximal iteration on auxilliary variables z Omega*y -- an ADMM style method -- and it also offers fast approximation of y(x) via a network that contains an unwound, truncated ADMM with re-learned parameters along the lines of Gregor & LeCun [11]. Finally, it proposes supervised parameter (mapping matrix) learning using stochastic gradient descent over an arbitrary problem-specific loss function on y(x), and to this end it derives some explicit formulae for cost gradients in terms of sign(Omega*y) and the matrices and auxilliary vectors involved. The method is illustrated on an image super-resolution problem and a polyphonic music transcription one.
Understanding Dropout
Dropout is a relatively new algorithm for training neural networks which relies on stochastically "dropping out" neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function.
6d0f846348a856321729a2f36734d1a7-Reviews.html
This equation, and others, probably overlap quite substantially with the approaches taken in e.g. "Optimization Algorithms on Matrix Manifolds" by Absil, P., Mahony, R., Sepulchre, R. ( this reference also addresses more general gradient flows, stepsize, retraction ("projection"), and convergence issues). S. Bonnabel, "Stochastic gradient descent on Riemannian manifolds" (arXiv) A. Edelman et al. "THE GEOMETRY OF ALGORITHMS WITH ORTHOGONALITY CONSTRAINTS" - The perspective of gradient descent on matrix manifolds is, to this reviewer, a major omission. The paper could greatly benefit from a discussion as to how this slice of the literature fits in to the picture, in terms of both practical algorithms and theory. I understand that space is a constraint. But if the authors decide to ultimately write a journal version of the submission, careful, considered inclusion of the above would make for a fantastic, authoritative paper bridging all of the major points of view on the topic.
Linear Convergence with Condition Number Independent Access of Full Gradients
For smooth and strongly convex optimizations, the optimal iteration complexity of the gradient-based algorithm is O( κlog1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution.
Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods.
Estimation Optimization and Parallelism when Data is Sparse
We study stochastic optimization problems when the data is sparse, which is in a sense dual to current perspectives on high-dimensional statistical learning and optimization. We highlight both the difficulties--in terms of increased sample complexity that sparse data necessitates--and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks.