Goto

Collaborating Authors

 svrg


AMore on the background

Neural Information Processing Systems

A.1 SVRG and SCSG Here we provide the pseudocode for SVRG (Algorithm 2) and SCSG (Algorithm 3) seen in Lei et al. [35]. The idea of SVRG (Algorithm 2) is to reuses past full gradient computations (line 3) to reduce the variance of the current stochastic gradient estimate (line 7) before the parameter update (line 8). Note that N = 1 corresponds to a GD step (i.e., v SVRG achieves linear convergence O(1/T) using the semi-stochastic gradient. The key difference is that SCSG (Algorithm 3) considers a sequence of time-varying batch sizes (Bt and bt) and employs geometric sampling to generate the number of parameter update steps Nt in each iteration (line 6), instead of fixing the batch sizes and the number of updates as done in SVRG. Particularly when finding an -approximate solution (Definition 1) for optimizing smooth non-convex objectives, Lei et al. [35] proves that SCSG is never worse than SVRG in convergence rate and significantly outperforms SVRG when the requiredis small.


Without-Replacement Sampling for Stochastic Gradient Methods Ohad Shamir Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel ohad.shamir@weizmann.ac.il

Neural Information Processing Systems

Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled with replacement. In contrast, sampling without replacement is far less understood, yet in practice it is very common, often easier to implement, and usually performs better. In this paper, we provide competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data. Moreover, we describe a useful application of these results in the context of distributed optimization with randomly-partitioned data, yielding a nearly-optimal algorithm for regularized least squares (in terms of both communication complexity and runtime complexity) under broad parameter regimes. Our proof techniques combine ideas from stochastic optimization, adversarial online learning and transductive learning theory, and can potentially be applied to other stochastic optimization and learning problems.


Stochastic Variance Reduction Methods for Saddle-Point Problems

Neural Information Processing Systems

We consider convex-concave saddle-point problems where the objective functions may be split in many components, and extend recent stochastic variance reduction methods (such as SVRG or SAGA) to provide the first large-scale linearly convergent algorithms for this class of problems which are common in machine learning. While the algorithmic extension is straightforward, it comes with challenges and opportunities: (a) the convex minimization analysis does not apply and we use the notion of monotone operators to prove convergence, showing in particular that the same algorithm applies to a larger class of problems, such as variational inequalities, (b) there are two notions of splits, in terms of functions, or in terms of partial derivatives, (c) the split does need to be done with convex-concave terms, (d) non-uniform sampling is key to an efficient algorithm, both in theory and practice, and (e) these incremental algorithms can be easily accelerated using a simple extension of the "catalyst" framework, leading to an algorithm which is always superior to accelerated batch algorithms.


Stochastic Nested Variance Reduction for Nonconvex Optimization

Neural Information Processing Systems

We study finite-sum nonconvex optimization problems, where the objective function is an average of $n$ nonconvex functions. We propose a new stochastic gradient descent algorithm based on nested variance reduction. Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each iteration, our algorithm uses $K+1$ nested reference points to build a semi-stochastic gradient to further reduce its variance in each iteration.




On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Neural Information Processing Systems

SVR methods use control variates to reduce the variance of the traditional stochastic gradient descent (SGD) estimate f0i(w) of the full gradient f0(w). Control variates are a classical technique for reducing the variance of a stochastic quantity without introducing bias. Say we have some random variable X.



SVRG and Beyond via Posterior Correction

arXiv.org Artificial Intelligence

Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.


First-Order Adaptive Sample Size Methods to Reduce Complexity of Empirical Risk Minimization

Neural Information Processing Systems

This paper studies empirical risk minimization (ERM) problems for large-scale datasets and incorporates the idea of adaptive sample size methods to improve the guaranteed convergence bounds for first-order stochastic and deterministic methods. In contrast to traditional methods that attempt to solve the ERM problem corresponding to the full dataset directly, adaptive sample size schemes start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically - e.g., scaling by a factor of two - and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. The gains are specific to the choice of method. When particularized to, e.g., accelerated gradient descent and stochastic variance reduce gradient, the computational cost advantage is a logarithm of the number of training samples. Numerical experiments on various datasets confirm theoretical claims and showcase the gains of using the proposed adaptive sample size scheme.