AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Total stochastic gradient algorithms and applications in reinforcement learning

Neural Information Processing SystemsFeb-14-2020, 20:58:19 GMT

Backpropagation and the chain rule of derivatives have been prominent; however, the total derivative rule has not enjoyed the same amount of attention. In this work we show how the total derivative rule leads to an intuitive visual framework for creating gradient estimators on graphical models. In particular, previous "policy gradient theorems" are easily derived. We derive new gradient estimators based on density estimation, as well as a likelihood ratio gradient, which "jumps" to an intermediate node, not directly to the objective function. We evaluate our methods on model-based policy gradient algorithms, achieve good performance, and present evidence towards demystifying the success of the popular PILCO algorithm.

reinforcement, stochastic gradient algorithm and application, total stochastic gradient algorithm, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

Variance-Reduced Stochastic Gradient Descent on Streaming Data

Jothimurugesan, Ellango, Tahmasbi, Ashraf, Gibbons, Phillip, Tirthapura, Srikanta

Neural Information Processing SystemsFeb-14-2020, 20:56:45 GMT

We present an algorithm STRSAGA for efficiently maintaining a machine learning model over data points that arrive over time, quickly updating the model as new training data is observed. We present a competitive analysis comparing the sub-optimality of the model maintained by STRSAGA with that of an offline algorithm that is given the entire data beforehand, and analyze the risk-competitiveness of STRSAGA under different arrival patterns. Our theoretical and experimental results show that the risk of STRSAGA is comparable to that of offline algorithms on a variety of input arrival patterns, and its experimental performance is significantly better than prior algorithms suited for streaming data, such as SGD and SSVRG. Papers published at the Neural Information Processing Systems Conference.

streaming data, strsaga, variance-reduced stochastic gradient descent, (3 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

On Markov Chain Gradient Descent

Sun, Tao, Sun, Yuejiao, Yin, Wotao

Neural Information Processing SystemsFeb-14-2020, 20:56:42 GMT

Stochastic gradient methods are the workhorse (algorithms) of large-scale optimization problems in machine learning, signal processing, and other computational sciences and engineering. This paper studies Markov chain gradient descent, a variant of stochastic gradient descent where the random samples are taken on the trajectory of a Markov chain. Existing results of this method assume convex objectives and a reversible Markov chain and thus have their limitations. We establish new non-ergodic convergence under wider step sizes, for nonconvex problems, and for non-reversible finite-state Markov chains. Nonconvexity makes our method applicable to broader problem classes.

finite-state markov chain, markov chain, markov chain gradient descent

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

ATOMO: Communication-efficient Learning via Atomic Sparsification

Wang, Hongyi, Sievert, Scott, Liu, Shengchao, Charles, Zachary, Papailiopoulos, Dimitris, Wright, Stephen

Neural Information Processing SystemsFeb-14-2020, 20:56:28 GMT

Distributed model training suffers from communication overheads due to frequent gradient updates transmitted between compute nodes. To mitigate these overheads, several studies propose the use of sparsified stochastic gradients. We argue that these are facets of a general sparsification method that can operate on any possible atomic decomposition. Notable examples include element-wise, singular value, and Fourier decompositions. Given a gradient, an atomic decomposition, and a sparsity budget, ATOMO gives a random unbiased sparsification of the atoms minimizing variance.

atomic sparsification, communication-efficient learning, decomposition, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)

Add feedback

Neural Proximal Gradient Descent for Compressive Imaging

Mardani, Morteza, Sun, Qingyun, Donoho, David, Papyan, Vardan, Monajemi, Hatef, Vasanawala, Shreyas, Pauly, John

Neural Information Processing SystemsFeb-14-2020, 20:43:40 GMT

Recovering high-resolution images from limited sensory data typically leads to a serious ill-posed inverse problem, demanding inversion algorithms that effectively capture the prior information. Learning a good inverse mapping from training data faces severe challenges, including: (i) scarcity of training data; (ii) need for plausible reconstructions that are physically feasible; (iii) need for fast reconstruction, especially in real-time applications. We develop a successful system solving all these challenges, using as basic architecture the repetitive application of alternating proximal and data fidelity constraints. We learn a proximal map that works well with real images based on residual networks with recurrent blocks. Extensive experiments are carried out under different settings: (a) reconstructing abdominal MRI of pediatric patients from highly undersampled k-space data and (b) super-resolving natural face images.

compressive imaging, neural proximal gradient descent, reconstruction, (3 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Therapeutic Area > Pediatrics/Neonatology (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis

George, Thomas, Laurent, César, Bouthillier, Xavier, Ballas, Nicolas, Vincent, Pascal

Neural Information Processing SystemsFeb-14-2020, 20:43:33 GMT

For models with many parameters, the covari- ance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approxima- tions and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.

approximation, fast approximate natural gradient descent, kronecker factored eigenbasis, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

Implicit Bias of Gradient Descent on Linear Convolutional Networks

Gunasekar, Suriya, Lee, Jason D., Soudry, Daniel, Srebro, Nati

Neural Information Processing SystemsFeb-14-2020, 20:43:04 GMT

We show that gradient descent on full-width linear convolutional networks of depth $L$ converges to a linear predictor related to the $\ell_{2/L}$ bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear SVM solution, regardless of depth. Papers published at the Neural Information Processing Systems Conference.

gradient descent, implicit bias, linear convolutional network, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

The Effect of Network Width on the Performance of Large-batch Training

Chen, Lingjiao, Wang, Hongyi, Zhao, Jinman, Papailiopoulos, Dimitris, Koutris, Paraschos

Neural Information Processing SystemsFeb-14-2020, 20:42:14 GMT

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

large-batch training, neural network, wider network, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization

Daskalakis, Constantinos, Panageas, Ioannis

Neural Information Processing SystemsFeb-14-2020, 20:41:53 GMT

Motivated by applications in Optimization, Game Theory, and the training of Generative Adversarial Networks, the convergence properties of first order methods in min-max problems have received extensive study. It has been recognized that they may cycle, and there is no good understanding of their limit points when they do not. When they converge, do they converge to local min-max solutions? We characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA). We show that both dynamics avoid unstable critical points for almost all initializations.

critical point, gradient descent, min-max optimization, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)

Add feedback

Stein Variational Gradient Descent as Moment Matching

Liu, Qiang, Wang, Dilin

Neural Information Processing SystemsFeb-14-2020, 20:26:29 GMT

Stein variational gradient descent (SVGD) is a non-parametric inference algorithm that evolves a set of particles to fit a given distribution of interest. We analyze the non-asymptotic properties of SVGD, showing that there exists a set of functions, which we call the Stein matching set, whose expectations are exactly estimated by any set of particles that satisfies the fixed point equation of SVGD. This set is the image of Stein operator applied on the feature maps of the positive definite kernel used in SVGD. Our results provide a theoretical framework for analyzing the properties of SVGD with different kernels, shedding insight into optimal kernel choice. In particular, we show that SVGD with linear kernels yields exact estimation of means and variances on Gaussian distributions, while random Fourier features enable probabilistic bounds for distributional approximation.

matching, stein variational gradient descent, svgd, (4 more...)

Neural Information Processing Systems

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback