AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Exponentially convergent stochastic k-PCA without variance reduction

Neural Information Processing SystemsMar-19-2020, 01:46:09 GMT

We show, both theoretically and empirically, that the algorithm naturally adapts to data low-rankness and converges exponentially fast to the ground-truth principal subspace. Notably, our result suggests that despite various recent efforts to accelerate the convergence of stochastic-gradient based methods by adding a O(n)-time variance reduction step, for the k- PCA problem, a truly online SGD variant suffices to achieve exponential convergence on intrinsically low-rank data. Papers published at the Neural Information Processing Systems Conference.

exponentially convergent stochastic k-pca, krasulina, variance reduction, (1 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.71)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

Add feedback

Fast and Accurate Stochastic Gradient Estimation

Chen, Beidi, Xu, Yingchen, Shrivastava, Anshumali

Neural Information Processing SystemsMar-19-2020, 01:45:52 GMT

Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, in reality, leads to slower convergence in time.

accurate stochastic gradient estimation, convergence, stochastic gradient descent, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)

Add feedback

Dimension-Free Bounds for Low-Precision Training

Li, Zheng, Sa, Christopher M. De

Neural Information Processing SystemsMar-19-2020, 01:30:38 GMT

Low-precision training is a promising way of decreasing the time and energy cost of training machine learning models. Previous work has analyzed low-precision training algorithms, such as low-precision stochastic gradient descent, and derived theoretical bounds on their convergence rates. These bounds tend to depend on the dimension of the model $d$ in that the number of bits needed to achieve a particular error bound increases as $d$ increases. In this paper, we derive new bounds for low-precision training algorithms that do not contain the dimension $d$, which lets us better understand what affects the convergence of these algorithms as parameters scale. Our methods also generalize naturally to let us prove new convergence bounds on low-precision training with other quantization schemes, such as low-precision floating-point computation and logarithmic quantization.

dimension-free bound, low-precision training, low-precision training algorithm, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

Add feedback

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Li, Yuanzhi, Wei, Colin, Ma, Tengyu

Neural Information Processing SystemsMar-19-2020, 01:18:22 GMT

Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes low-noise, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing.

learning rate, regularization effect, training neural network, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

Cao, Yuan, Gu, Quanquan

Neural Information Processing SystemsMar-19-2020, 01:02:26 GMT

We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a \textit{neural tangent random feature} (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of $\tilde{\mathcal{O}}(n {-1/2})$ that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.

generalization error, stochastic gradient descent, wide and deep neural network, (2 more...)

Neural Information Processing Systems

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Efficient Convex Relaxations for Streaming PCA

Arora, Raman, Marinov, Teodor Vanislavov

Neural Information Processing SystemsMar-19-2020, 00:49:14 GMT

We revisit two algorithms, matrix stochastic gradient (MSG) and $\ell_2$-regularized MSG (RMSG), that are instances of stochastic gradient descent (SGD) on a convex relaxation to principal component analysis (PCA). These algorithms have been shown to outperform Oja's algorithm, empirically, in terms of the iteration complexity, and to have runtime comparable with Oja's. However, these findings are not supported by existing theoretical results. While the iteration complexity bound for $\ell_2$-RMSG was recently shown to match that of Oja's algorithm, its theoretical efficiency was left as an open problem. In this work, we give improved bounds on per iteration cost of mini-batched variants of both MSG and $\ell_2$-RMSG and arrive at an algorithm with total computational complexity matching that of Oja's algorithm.

algorithm, efficient convex relaxation, streaming pca, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)

Add feedback

Poincaré Recurrence, Cycles and Spurious Equilibria in Gradient-Descent-Ascent for Non-Convex Non-Concave Zero-Sum Games

Vlatakis-Gkaragkounis, Emmanouil-Vasileios, Flokas, Lampros, Piliouras, Georgios

Neural Information Processing SystemsMar-19-2020, 00:48:59 GMT

We study a wide class of non-convex non-concave min-max games that generalizes over standard bilinear zero-sum games. In this class, players control the inputs of a smooth function whose output is being applied to a bilinear zero-sum game. This class of games is motivated by the indirect nature of the competition in Generative Adversarial Networks, where players control the parameters of a neural network while the actual competition happens between the distributions that the generator and discriminator capture. We establish theoretically, that depending on the specific instance of the problem gradient-descent-ascent dynamics can exhibit a variety of behaviors antithetical to convergence to the game theoretically meaningful min-max solution. Specifically, different forms of recurrent behavior (including periodicity and Poincar\'{e} recurrence) are possible as well as convergence to spurious (non-min-max) equilibria for a positive measure of initial conditions.

cycle and spurious equilibria, gradient-descent-ascent, non-convex non-concave zero-sum game, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.64)

Add feedback

On Fenchel Mini-Max Learning

Tao, Chenyang, Chen, Liqun, Dai, Shuyang, Chen, Junya, Bai, Ke, Wang, Dong, Feng, Jianfeng, Lu, Wenlian, Bobashev, Georgiy, Carin, Lawrence

Neural Information Processing SystemsMar-19-2020, 00:48:51 GMT

Inference, estimation, sampling and likelihood evaluation are four primary goals of probabilistic modeling. Practical considerations often force modeling approaches to make compromises between these objectives. We present a novel probabilistic learning framework, called Fenchel Mini-Max Learning (FML), that accommodates all four desiderata in a flexible and scalable manner. Our derivation is rooted in classical maximum likelihood estimation, and it overcomes a longstanding challenge that prevents unbiased estimation of unnormalized statistical models. By reformulating MLE as a mini-max game, FML enjoys an unbiased training objective that (i) does not explicitly involve the intractable normalizing constant and (ii) is directly amendable to stochastic gradient descent optimization.

Add feedback

Can SGD Learn Recurrent Neural Networks with Provable Generalization?

Allen-Zhu, Zeyuan, Li, Yuanzhi

Neural Information Processing SystemsMar-19-2020, 00:48:20 GMT

Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis. Yet, in the foundational PAC learning language, what concept class can it learn? Moreover, how can the same recurrent unit simultaneously learn functions from different input tokens to different output tokens, without affecting each other? In this paper, we show using the vanilla stochastic gradient descent (SGD), RNN can actually learn some notable concept class \emph{efficiently}, meaning that both time and sample complexity scale \emph{polynomially} in the input length (or almost polynomially, depending on the concept). This concept class at least includes functions where each output token is generated from inputs of earlier tokens using a smooth two-layer neural network. Papers published at the Neural Information Processing Systems Conference.

provable generalization, recurrent neural network, sgd learn recurrent neural network, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

Add feedback

Stochastic Variance Reduced Primal Dual Algorithms for Empirical Composition Optimization

Devraj, Adithya M, Chen, Jianshu

Neural Information Processing SystemsMar-19-2020, 00:32:46 GMT

We consider a generic empirical composition optimization problem, where there are empirical averages present both outside and inside nonlinear loss functions. Such a problem is of interest in various machine learning applications, and cannot be directly solved by standard methods such as stochastic gradient descent (SGD). We take a novel approach to solving this problem by reformulating the original minimization objective into an equivalent min-max objective, which brings out all the empirical averages that are originally inside the nonlinear loss functions. We exploit the rich structures of the reformulated problem and develop a stochastic primal-dual algorithms, SVRPDA-I, to solve the problem efficiently. We carry out extensive theoretical analysis of the proposed algorithm, obtaining the convergence rate, the total computation complexity and the storage complexity.

algorithm, stochastic variance reduced primal, variance reduced primal dual algorithm, (3 more...)

Neural Information Processing Systems

Genre: Research Report (0.62)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback