Goto

Collaborating Authors

 spiderboost


Variance Reduction with Sparse Gradients

arXiv.org Machine Learning

A BSTRACT V ariance reduction methods such as SVRG (Johnson & Zhang, 2013) and SpiderBoost (Wang et al., 2018) use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD (Robbins & Monro, 1951), these methods require at least double the number of operations per update to model parameters. To reduce the computational cost of these methods, we introduce a new sparsity operator: The random-top- k operator. Our operator reduces computational complexity by estimating gradient sparsity exhibited in a variety of applications by combining the top-k operator (Stich et al., 2018; Aji & Heafield, 2017) and the randomized coordinate descent operator. With this operator, large batch gradients offer an extra benefit beyond variance reduction: A reliable estimate of gradient sparsity. Theoretically, our algorithm is at least as good as the best algorithm (SpiderBoost), and further excels in performance whenever the random-top- k operator captures gradient sparsity. Empirically, our algorithm consistently outperforms SpiderBoost using various models on various tasks including image classification, natural language processing, and sparse matrix factorization. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration. It updates the iterate x with x η f I(x), where η is the learning rate and f I(x) is the batch stochastic gradient, i.e. f I(x) 1 I null i I f i(x).


SpiderBoost: A Class of Faster Variance-reduced Algorithms for Nonconvex Optimization

arXiv.org Machine Learning

There has been extensive research on developing stochastic variance reduced methods to solve large-scale optimization problems. More recently, a novel algorithm of such a type named SPIDER has been developed in \cite{Fang2018}, which was shown to outperform existing algorithms of the same type and meet the lower bound in certain regimes. Though interesting in theory, SPIDER requires $\epsilon$-level stepsize to guarantee the convergence, and consequently runs slow in practice. This paper proposes SpiderBoost as an improved SPIDER scheme, which comes with two major advantages compared to SPIDER. First, it allows much larger stepsize without sacrificing the convergence rate, and hence runs substantially faster than SPIDER in practice. Second, it extends much more easily to proximal algorithms with guaranteed convergence for solving composite optimization problems, which appears challenging for SPIDER due to stringent requirement on per-iteration increment to guarantee its convergence. Both advantages can be attributed to the new convergence analysis we develop for SpiderBoost that allows much more flexibility for choosing algorithm parameters. As further generalization of SpiderBoost, we show that proximal SpiderBoost achieves a stochastic first-order oracle (SFO) complexity of $\mathcal{O}(\min\{n^{1/2}\epsilon^{-1},\epsilon^{-3/2}\})$ for composite optimization, which improves the existing best results by a factor of $\mathcal{O}(\min\{n^{1/6},\epsilon^{-1/6}\})$.