AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

SGD on Neural Networks Learns Functions of Increasing Complexity

Kalimeris, Dimitris, Kaplun, Gal, Nakkiran, Preetum, Edelman, Benjamin, Yang, Tristan, Barak, Boaz, Zhang, Haofeng

Neural Information Processing SystemsMar-18-2020, 21:48:27 GMT

We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. We also show that the linear classifier learned in the initial stages is retained'' throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model.

classifier, complexity, neural network learn function, (3 more...)

Neural Information Processing Systems

Genre: Research Report (0.71)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients

Sun, Jun, Chen, Tianyi, Giannakis, Georgios, Yang, Zaiyue

Neural Information Processing SystemsMar-18-2020, 21:47:49 GMT

The present paper develops a novel aggregated gradient approach for distributed machine learning that adaptively compresses the gradient communication. The key idea is to first quantize the computed gradients, and then skip less informative quantized gradient communications by reusing outdated gradients. Quantizing and skipping result in'lazy' worker-server communications, which justifies the term Lazily Aggregated Quantized gradient that is henceforth abbreviated as LAQ. Our LAQ can provably attain the same linear convergence rate as the gradient descent in the strongly convex case, while effecting major savings in the communication overhead both in transmitted bits as well as in communication rounds. Empirically, experiments with real data corroborate a significant communication reduction compared to existing gradient- and stochastic gradient-based algorithms.

communication-efficient, lazily aggregated quantized gradient, learning, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Ultrametric Fitting by Gradient Descent

Chierchia, Giovanni, Perret, Benjamin

Neural Information Processing SystemsMar-18-2020, 21:46:50 GMT

We study the problem of fitting an ultrametric distance to a dissimilarity graph in the context of hierarchical cluster analysis. Standard hierarchical clustering methods are specified procedurally, rather than in terms of the cost function to be optimized. We aim to overcome this limitation by presenting a general optimization framework for ultrametric fitting. Our approach consists of modeling the latter as a constrained optimization problem over the continuous space of ultrametrics. So doing, we can leverage the simple, yet effective, idea of replacing the ultrametric constraint with a min-max operation injected directly into the cost function.

cost function, gradient descent, ultrametric fitting, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.44)

Add feedback

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

He, Haowei, Huang, Gao, Yuan, Yang

Neural Information Processing SystemsMar-18-2020, 21:30:52 GMT

Despite the non-convex nature of their loss functions, deep neural networks are known to generalize well when optimized with stochastic gradient descent (SGD). Recent work conjectures that SGD with proper configuration is able to find wide and flat local minima, which are correlated with good generalization performance. In this paper, we observe that local minima of modern deep networks are more than being flat or sharp. Instead, at a local minimum there exist many asymmetric directions such that the loss increases abruptly along one side, and slowly along the opposite side – we formally define such minima as asymmetric valleys. Under mild assumptions, we first prove that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact empirical minimizer.

asymmetric valley, sharp and flat local minima

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.63)

Add feedback

An Improved Analysis of Training Over-parameterized Deep Neural Networks

Zou, Difan, Gu, Quanquan

Neural Information Processing SystemsMar-18-2020, 21:15:59 GMT

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size $n$ (e.g., $O(n {24})$). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work. Papers published at the Neural Information Processing Systems Conference.

artificial intelligence, machine learning, training over-parameterized deep neural network, (7 more...)

Neural Information Processing Systems

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback

SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

Li, Zhize

Neural Information Processing SystemsMar-18-2020, 21:01:55 GMT

We analyze stochastic gradient algorithms for optimizing nonconvex problems. In particular, our goal is to find local minima (second-order stationary points) instead of just finding first-order stationary points which may be some bad unstable saddle points. We show that a simple perturbed version of stochastic recursive gradient descent algorithm (called SSRGD) can find an $(\epsilon,\delta)$-second-order stationary point with $\widetilde{O}(\sqrt{n}/\epsilon 2 \sqrt{n}/\delta 4 n/\delta 3)$ stochastic gradient complexity for nonconvex finite-sum problems. As a by-product, SSRGD finds an $\epsilon$-first-order stationary point with $O(n \sqrt{n}/\epsilon 2)$ stochastic gradients. These results are almost optimal since Fang et al. [2018] provided a lower bound $\Omega(\sqrt{n}/\epsilon 2)$ for finding even just an $\epsilon$-first-order stationary point.

algorithm, simple stochastic recursive gradient descent, stationary point, (8 more...)

Neural Information Processing Systems

Industry:

Transportation > Passenger (0.40)
Transportation > Ground > Road (0.40)
Automobiles & Trucks > Manufacturer (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence

He, Fengxiang, Liu, Tongliang, Tao, Dacheng

Neural Information Processing SystemsMar-18-2020, 20:47:07 GMT

Deep neural networks have received dramatic success based on the optimization method of stochastic gradient descent (SGD). However, it is still not clear how to tune hyper-parameters, especially batch size and learning rate, to ensure good generalization. This paper reports both theoretical and empirical evidence of a training strategy that we should control the ratio of batch size to learning rate not too large to achieve a good generalization ability. Specifically, we prove a PAC-Bayes generalization bound for neural networks trained by SGD, which has a positive correlation with the ratio of batch size to learning rate. This correlation builds the theoretical foundation of the training strategy. Furthermore, we conduct a large-scale experiment to verify the correlation and training strategy.

batch size and learning rate, theoretical and empirical evidence, training strategy, (5 more...)

Neural Information Processing Systems

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Nguyen, Thanh Huy, Simsekli, Umut, Gurbuzbalaban, Mert, RICHARD, Gaël

Neural Information Processing SystemsMar-18-2020, 20:30:32 GMT

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit.

exit time analysis, heavy-tailed gradient noise, stochastic gradient descent, (3 more...)

Neural Information Processing Systems

Genre: Research Report (0.37)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Wu, Liwei, Li, Shuqing, Hsieh, Cho-Jui, Sharpnack, James L.

Neural Information Processing SystemsMar-18-2020, 20:18:50 GMT

In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information.

data-driven regularization, embedding layer, stochastic shared embedding, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.64)

Add feedback

NeCPD: An Online Tensor Decomposition with Optimal Stochastic Gradient Descent

Anaissi, Ali, Suleiman, Basem, Zandavi, Seid Miad

arXiv.org Machine LearningMar-18-2020

Multi-way data analysis has become an essential tool for capturing underlying structures in higher-order datasets stored in tensor $\mathcal{X} \in \mathbb{R} ^{I_1 \times \dots \times I_N} $. $CANDECOMP/PARAFAC$ (CP) decomposition has been extensively studied and applied to approximate $\mathcal{X}$ by $N$ loading matrices $A^{(1)}, \dots, A^{(N)}$ where $N$ represents the order of the tensor. We propose a new efficient CP decomposition solver named NeCPD for non-convex problem in multi-way online data based on stochastic gradient descent (SGD) algorithm. SGD is very useful in online setting since it allows us to update $\mathcal{X}^{(t+1)}$ in one single step. In terms of global convergence, it is well known that SGD stuck in many saddle points when it deals with non-convex problems. We study the Hessian matrix to identify theses saddle points, and then try to escape them using the perturbation approach which adds little noise to the gradient update step. We further apply Nesterov's Accelerated Gradient (NAG) method in SGD algorithm to optimally accelerate the convergence rate and compensate Hessian computational delay time per epoch. Experimental evaluation in the field of structural health monitoring using laboratory-based and real-life structural datasets show that our method provides more accurate results compared with existing online tensor analysis methods.

algorithm, decomposition, matrix, (14 more...)

arXiv.org Machine Learning

2003.08844

Country:

Oceania > Australia > New South Wales (0.04)
North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)
Europe > Denmark > North Jutland > Aalborg (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback