Goto

Collaborating Authors

 Gradient Descent


On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

arXiv.org Machine Learning

Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad) (Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a function of past gradients, can achieve better performance than vanilla SGD in practice when the gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically adjusts the learning rate for each feature based on the partial gradient, which accelerates the convergence. However, AdaGrad was later found to demonstrate degraded performance especially in cases where the loss function is nonconvex or the gradient is dense, due to rapid decay of learning rate.


Frank-Wolfe Style Algorithms for Large Scale Optimization

arXiv.org Machine Learning

We introduce a few variants on Frank-Wolfe style algorithms suitable for large scale optimization. We show how to modify the standard Frank-Wolfe algorithm using stochastic gradients, approximate subproblem solutions, and sketched decision variables in order to scale to enormous problems while preserving (up to constants) the optimal convergence rate $\mathcal{O}(\frac{1}{k})$.


Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

arXiv.org Machine Learning

Neural networks with ReLU activations have achieved great empirical success in various domains. However, existing results for learning ReLU networks either pose assumptions on the underlying data distribution being e.g. Gaussian, or require the network size and/or training size to be sufficiently large. In this context, the problem of learning a two-layer ReLU network is approached in a binary classification setting, where the data are linearly separable and a hinge loss criterion is adopted. Leveraging the power of random noise, this contribution presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima and saddle points in general. This result is the first of its kind, requiring no assumptions on the data distribution, training/network size, or initialization. Convergence of the resultant iterative algorithm to a global minimum is analyzed by establishing both an upper bound and a lower bound on the number of effective (non-zero) updates to be performed. Furthermore, generalization guarantees are developed for ReLU networks trained with the novel SGD. These guarantees highlight a fundamental difference (at least in the worst case) between learning a ReLU network as well as a leaky ReLU network in terms of sample complexity. Numerical tests using synthetic data and real images validate the effectiveness of the algorithm and the practical merits of the theory.


Discrete gradient descent differs qualitatively from gradient flow

arXiv.org Machine Learning

We consider gradient descent on functions of the form $L_1 = |f|$ and $L_2 = f^2$, where $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is any smooth function with 0 as a regular value. We show that gradient descent implemented with a discrete step size $\tau$ behaves qualitatively differently from continuous gradient descent. We show that over long time scales, continuous and discrete gradient descent on $L_1$ find different minima of $L_1$, and we can characterize the difference - the minima that tend to be found by discrete gradient descent lie in a secondary critical submanifold $M' \subset M$, the locus within $M$ where the function $K=|\nabla f|^2 \big|_M$ is minimized. In this paper, we explain this behavior. We also study the more subtle behavior of discrete gradient descent on $L_2$.


Understanding training and generalization in deep learning by Fourier analysis

arXiv.org Machine Learning

Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)-- equipped with many more parameters than training data and trained by (stochastic) gradient-based methods-- often achieve remarkably low generalization error. Contribution: We study DNN training by Fourier analysis. Our theoretical framework explains: i) DNN with (stochastic) gradient-based methods endows low-frequency components of the target function with a higher priority during the training; ii) Small initialization leads to good generalization ability of DNN while preserving the DNN's ability of fitting any function. These results are further confirmed by experiments of DNNs fitting the following datasets, i.e., natural images, one-dimensional functions and MNIST dataset.


Gradient Descent: All You Need to Know – Hacker Noon

#artificialintelligence

What's the one algorithm that's used in almost every Machine Learning model? There are a few variations of the algorithm but this, essentially, is how any ML model learns. Without this, ML wouldn't be where it is right now. In this post, I will be explaining Gradient Descent with a little bit of math. Honestly, GD(Gradient Descent) doesn't inherently involve a lot of math(I'll explain this later).


Relax, and Accelerate: A Continuous Perspective on ADMM

arXiv.org Machine Learning

The acceleration technique first introduced by Nesterov for gradient descent is widely used in many machine learning applications, however it is not yet well-understood. Recently, significant progress has been made to close this understanding gap by using a continuous-time dynamical system perspective associated with gradient-based methods. In this paper, we extend this perspective by considering the continuous limit of the family of relaxed Alternating Direction Method of Multipliers (ADMM). We also introduce two new families of relaxed and accelerated ADMM algorithms, one follows Nesterov's acceleration approach and the other is inspired by Polyak's Heavy Ball method, and derive the continuous limit of these families of relaxed and accelerated algorithms as differential equations. Then, using a Lyapunov stability analysis of the dynamical systems, we obtain rate-of-convergence results for convex and strongly convex objective functions.


On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks

arXiv.org Machine Learning

Adaptive stochastic gradient descent methods, such as AdaGrad, Adam, AdaDelta, Nadam, AMSGrad, \textit{etc.}, have been demonstrated efficacious in solving non-convex stochastic optimization, such as training deep neural networks. However, their convergence rates have not been touched under the non-convex stochastic circumstance except recent breakthrough results on AdaGrad \cite{ward2018adagrad} and perturbed AdaGrad \cite{li2018convergence}. In this paper, we propose two new adaptive stochastic gradient methods called AdaHB and AdaNAG which integrate coordinate-wise AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum, respectively. The $\mathcal{O}(\frac{\log{T}}{\sqrt{T}})$ non-asymptotic convergence rates of AdaHB and AdaNAG in non-convex stochastic setting are also jointly characterized by leveraging a newly developed unified formulation of these two momentum mechanisms. In particular, when momentum term vanishes we obtain convergence rate of coordinate-wise AdaGrad in non-convex stochastic setting as a byproduct.


Ensemble Kalman Inversion: A Derivative-Free Technique For Machine Learning Tasks

arXiv.org Machine Learning

The standard probabilistic perspective on machine learning gives rise to empirical risk-minimization tasks that are frequently solved by stochastic gradient descent (SGD) and variants thereof. We present a formulation of these tasks as classical inverse or filtering problems and, furthermore, we propose an efficient, gradient-free algorithm for finding a solution to these problems using ensemble Kalman inversion (EKI). Applications of our approach include offline and online supervised learning with deep neural networks, as well as graph-based semi-supervised learning. The essence of the EKI procedure is an ensemble based approximate gradient descent in which derivatives are replaced by differences from within the ensemble. We suggest several modifications to the basic method, derived from empirically successful heuristics developed in the context of SGD. Numerical results demonstrate wide applicability and robustness of the proposed algorithm.


Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers

arXiv.org Machine Learning

Abstract--Distributed gradient descent (DGD) is an efficient way of implementing gradient descent (GD), especially for large data sets, by dividing the computation tasks into smaller sub-tasks and assigning to different computing servers (CSs) to be executed in parallel. In standard parallel execution, per-iteration waiting time is limited by the execution time of the straggling servers. Coded DGD techniques have been introduced recently, which can tolerate straggling servers via assigning redundant computation tasks to the CSs. In most of the existing DGD schemes, either with coded computation or coded communication, the non-straggling CSs transmit one message per iteration once they complete all their assigned computation tasks. However, although the straggling servers cannot complete all their assigned tasks, they are often able to complete a certain portion of them. In this paper, we allow multiple transmissions from each CS at each iteration in order to make sure a maximum number of completed computations can be reported to the aggregating server (AS), including the straggling servers. We numerically show that the average completion time per iteration can be reduced significantly by slightly increasing the communication load per server . Index Terms --Distributed gradient descent, coded computation, coded gradient, polynomial codes, maximum-distance separable codes.