AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

The promises and pitfalls of Stochastic Gradient Langevin Dynamics

Brosse, Nicolas, Durmus, Alain, Moulines, Eric

Neural Information Processing SystemsFeb-14-2020, 20:12:49 GMT

Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC algorithm for Bayesian learning from large scale datasets. While SGLD with decreasing step sizes converges weakly to the posterior distribution, the algorithm is often used with a constant step size in practice and has demonstrated spectacular successes in machine learning tasks. The current practice is to set the step size inversely proportional to N where N is the number of training samples. As N becomes large, we show that the SGLD algorithm has an invariant probability measure which significantly departs from the target posterior and behaves like as Stochastic Gradient Descent (SGD). This difference is inherently due to the high variance of the stochastic gradients.

algorithm, promise and pitfall, stochastic gradient langevin dynamic, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

Li, Yuanzhi, Liang, Yingyu

Neural Information Processing SystemsFeb-14-2020, 20:12:14 GMT

Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels. Furthermore, the analysis provides interesting insights into several aspects of learning neural networks and can be verified based on empirical studies on synthetic data and on the MNIST dataset. Papers published at the Neural Information Processing Systems Conference.

learning overparameterized neural network, stochastic gradient descent, structured data

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Pillaud-Vivien, Loucas, Rudi, Alessandro, Bach, Francis

Neural Information Processing SystemsFeb-14-2020, 20:11:59 GMT

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

hard learning problem, statistical optimality, stochastic gradient descent, (5 more...)

Neural Information Processing Systems

Genre: Research Report (0.44)

Industry: Education > Focused Education > Special Education (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Stochastic Expectation Maximization with Variance Reduction

Chen, Jianfei, Zhu, Jun, Teh, Yee Whye, Zhang, Tong

Neural Information Processing SystemsFeb-14-2020, 19:58:20 GMT

Expectation-Maximization (EM) is a popular tool for learning latent variable models, but the vanilla batch EM does not scale to large data sets because the whole data set is needed at every E-step. Stochastic Expectation Maximization (sEM) reduces the cost of E-step by stochastic approximation. However, sEM has a slower asymptotic convergence rate than batch EM, and requires a decreasing sequence of step sizes, which is difficult to tune. In this paper, we propose a variance reduced stochastic EM (sEM-vr) algorithm inspired by variance reduced stochastic gradient descent algorithms. We show that sEM-vr has the same exponential asymptotic convergence rate as batch EM.

algorithm, stochastic expectation maximization, variance reduction, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

cpSGD: Communication-efficient and differentially-private distributed SGD

Agarwal, Naman, Suresh, Ananda Theertha, Yu, Felix Xinnan X., Kumar, Sanjiv, McMahan, Brendan

Neural Information Processing SystemsFeb-14-2020, 19:56:09 GMT

Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For $d$ variables and $n \approx d$ clients, the proposed method uses $\cO(\log \log(nd))$ bits of communication per client per coordinate and ensures constant privacy.

communication-efficient and differentially-private, mechanism, privacy, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo

Havasi, Marton, Hernández-Lobato, José Miguel, Murillo-Fuentes, Juan José

Neural Information Processing SystemsFeb-14-2020, 19:55:53 GMT

Deep Gaussian Processes (DGPs) are hierarchical generalizations of Gaussian Processes that combine well calibrated uncertainty estimates with the high flexibility of multilayer models. One of the biggest challenges with these models is that exact inference is intractable. The current state-of-the-art inference method, Variational Inference (VI), employs a Gaussian approximation to the posterior distribution. This can be a potentially poor unimodal approximation of the generally multimodal posterior. In this work, we provide evidence for the non-Gaussian nature of the posterior and we apply the Stochastic Gradient Hamiltonian Monte Carlo method to generate samples.

deep gaussian process, inference, stochastic gradient hamiltonian monte carlo, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

Add feedback

Stochastic Mirror Descent in Variationally Coherent Optimization Problems

Zhou, Zhengyuan, Mertikopoulos, Panayotis, Bambos, Nicholas, Boyd, Stephen, Glynn, Peter W.

Neural Information Processing SystemsFeb-14-2020, 19:43:01 GMT

In this paper, we examine a class of non-convex stochastic optimization problems which we call variationally coherent, and which properly includes pseudo-/quasiconvex and star-convex optimization problems. To solve such problems, we focus on the widely used stochastic mirror descent (SMD) family of algorithms (which contains stochastic gradient descent as a special case), and we show that the last iterate of SMD converges to the problem's solution set with probability 1. This result contributes to the landscape of non-convex stochastic optimization by clarifying that neither pseudo-/quasi-convexity nor star-convexity is essential for (almost sure) global convergence; rather, variational coherence, a much weaker requirement, suffices. Characterization of convergence rates for the subclass of strongly variationally coherent optimization problems as well as simulation results are also presented. Papers published at the Neural Information Processing Systems Conference.

stochastic mirror descent, variationally coherent optimization problem

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Stochastic Chebyshev Gradient Descent for Spectral Optimization

Han, Insu, Avron, Haim, Shin, Jinwoo

Neural Information Processing SystemsFeb-14-2020, 19:42:43 GMT

A large class of machine learning techniques requires the solution of optimization problems involving spectral functions of parametric matrices, e.g. Unfortunately, computing the gradient of a spectral function is generally of cubic complexity, as such gradient descent methods are rather expensive for optimizing objectives involving the spectral function. Thus, one naturally turns to stochastic gradient methods in hope that they will provide a way to reduce or altogether avoid the computation of full gradients. However, here a new challenge appears: there is no straightforward way to compute unbiased stochastic gradients for spectral functions. In this paper, we develop unbiased stochastic gradients for spectral-sums, an important subclass of spectral functions.

artificial intelligence, machine learning, spectral function, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Training Deep Models Faster with Robust, Approximate Importance Sampling

Johnson, Tyler B., Guestrin, Carlos

Neural Information Processing SystemsFeb-14-2020, 19:41:05 GMT

In practice, the cost of computing importances greatly limits the impact of importance sampling. We propose a robust, approximate importance sampling procedure (RAIS) for stochastic gradient de- scent. By approximating the ideal sampling distribution using robust optimization, RAIS provides much of the benefit of exact importance sampling with drastically reduced overhead. Empirically, we find RAIS-SGD and standard SGD follow similar learning curves, but RAIS moves faster through these paths, achieving speed-ups of at least 20% and sometimes much more. Papers published at the Neural Information Processing Systems Conference.

approximate importance sampling, robust

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.53)

Add feedback

Stochastic Submodular Maximization: The Case of Coverage Functions

Karimi, Mohammad, Lucic, Mario, Hassani, Hamed, Krause, Andreas

Neural Information Processing SystemsFeb-14-2020, 19:40:55 GMT

Stochastic optimization of continuous objectives is at the heart of modern machine learning. However, many important problems are of discrete nature and often involve submodular objectives. We seek to unleash the power of stochastic continuous optimization, namely stochastic gradient descent and its variants, to such discrete problems. We first introduce the problem of stochastic submodular optimization, where one needs to optimize a submodular objective which is given as an expectation. Our model captures situations where the discrete objective arises as an empirical risk (e.g., in the case of exemplar-based clustering), or is given as an explicit stochastic model (e.g., in the case of influence maximization in social networks).

coverage function, optimization, stochastic submodular maximization, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback