Goto

Collaborating Authors

 Gradient Descent


An overview of gradient descent optimization algorithms

@machinelearnbot

Note: If you are looking for a review paper, this blog post is also available as an article on arXiv. Added derivations of AdaMax and Nadam. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules.


Stochastic Gradient Descent in Continuous Time

arXiv.org Machine Learning

Stochastic gradient descent in continuous time (SGDCT) provides a computationally efficient method for the statistical learning of continuous-time models, which are widely used in science, engineering, and finance. The SGDCT algorithm follows a (noisy) descent direction along a continuous stream of data. SGDCT performs an online parameter update in continuous time, with the parameter updates $\theta_t$ satisfying a stochastic differential equation. We prove that $\lim_{t \rightarrow \infty} \nabla \bar g(\theta_t) = 0$ where $\bar g$ is a natural objective function for the estimation of the continuous-time dynamics. The convergence proof leverages ergodicity by using an appropriate Poisson equation to help describe the evolution of the parameters for large times. SGDCT can also be used to solve continuous-time optimization problems, such as American options. For certain continuous-time problems, SGDCT has some promising advantages compared to a traditional stochastic gradient descent algorithm. As an example application, SGDCT is combined with a deep neural network to price high-dimensional American options (up to 100 dimensions).


Automated Design using Neural Networks and Gradient Descent

arXiv.org Machine Learning

We propose a novel method that makes use of deep neural networks and gradient decent to perform automated design on complex real world engineering tasks. Our approach works by training a neural network to mimic the fitness function of a design optimization task and then, using the differential nature of the neural network, perform gradient decent to maximize the fitness. We demonstrate this methods effectiveness by designing an optimized heat sink and both 2D and 3D airfoils that maximize the lift drag ratio under steady state flow conditions. We highlight that our method has two distinct benefits over other automated design approaches. First, evaluating the neural networks prediction of fitness can be orders of magnitude faster then simulating the system of interest. Second, using gradient decent allows the design space to be searched much more efficiently then other gradient free methods. These two strengths work together to overcome some of the current shortcomings of automated design.


SGDLibrary: A MATLAB library for stochastic gradient descent algorithms

arXiv.org Machine Learning

We consider the problem of finding the minimizer of a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ of the form $\min f(w) = \frac{1}{n}\sum_{i}f_i({w})$. This problem has been studied intensively in recent years in machine learning research field. One typical but promising approach for large-scale data is stochastic optimization algorithm. SGDLibrary is a flexible, extensible and efficient pure-Matlab library of a collection of stochastic optimization algorithms. The purpose of the library is to provide researchers and implementers a comprehensive evaluation environment of those algorithms on various machine learning problems.


Gradient Sparsification for Communication-Efficient Distributed Optimization

arXiv.org Machine Learning

Modern large scale machine learning applications require scaling stochastic optimization algorithms to distributed computational architectures. A key bottleneck is the communication overhead for exchanging information among different workers. For example, we have n training data distributed on M workers, and each of them owns its local copy of the model parameter vector. In the synchronized stochastic gradient method, each worker processes a random minibatch of its training data, and then the local updates are synchronized by making an All-Reduce step, which aggregates stochastic gradients from all workers, and taking a Broadcast step that transmits the updated parameter vector back to all workers. The process is repeated until an appropriate convergence criterion is met. An important factor that may significantly slow down any optimization algorithm is the communication cost among workers.


Duality-free Methods for Stochastic Composition Optimization

arXiv.org Machine Learning

We consider the composition optimization with two expected-value functions in the form of $\frac{1}{n}\sum\nolimits_{i = 1}^n F_i(\frac{1}{m}\sum\nolimits_{j = 1}^m G_j(x))+R(x)$, { which formulates many important problems in statistical learning and machine learning such as solving Bellman equations in reinforcement learning and nonlinear embedding}. Full Gradient or classical stochastic gradient descent based optimization algorithms are unsuitable or computationally expensive to solve this problem due to the inner expectation $\frac{1}{m}\sum\nolimits_{j = 1}^m G_j(x)$. We propose a duality-free based stochastic composition method that combines variance reduction methods to address the stochastic composition problem. We apply SVRG and SAGA based methods to estimate the inner function, and duality-free method to estimate the outer function. We prove the linear convergence rate not only for the convex composition problem, but also for the case that the individual outer functions are non-convex while the objective function is strongly-convex. We also provide the results of experiments that show the effectiveness of our proposed methods.


Neural Network Foundations, Explained: Updating Weights with Gradient Descent & Backpropagation

@machinelearnbot

Recall that in order for a neural networks to learn, weights associated with neuron connections must be updated after forward passes of data through the network. These weights are adjusted to help reconcile the differences between the actual and predicted outcomes for subsequent forward passes. But how, exactly, do the weights get adjusted? Before we get to the actual adjustments, think of what would be needed at each neuron in order to make a meaningful change to a given weight. Since we are talking about the difference between actual and predicted values, the error would be a useful measure here, and so each neuron will require that their respective error be sent backward through the network to them in order to facilitate the update process; hence, backpropagation of error.


A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

arXiv.org Machine Learning

This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model mis-specification.


Stability and Generalization of Learning Algorithms that Converge to Global Optima

arXiv.org Machine Learning

We establish novel generalization bounds for learning algorithms that converge to global minima. We do so by deriving black-box stability results that only depend on the convergence of a learning algorithm and the geometry around the minimizers of the loss function. The results are shown for nonconvex loss functions satisfying the Polyak-{\L}ojasiewicz (PL) and the quadratic growth (QG) conditions. We further show that these conditions arise for some neural networks with linear activations. We use our black-box results to establish the stability of optimization algorithms such as stochastic gradient descent (SGD), gradient descent (GD), randomized coordinate descent (RCD), and the stochastic variance reduced gradient method (SVRG), in both the PL and the strongly convex setting. Our results match or improve state-of-the-art generalization bounds and can easily be extended to similar optimization algorithms. Finally, we show that although our results imply comparable stability for SGD and GD in the PL setting, there exist simple neural networks with multiple local minima where SGD is stable but GD is not.


Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent

arXiv.org Machine Learning

We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning -- a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m-1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failure-free setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function.