AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

Gupta, Vipul, Serrano, Santiago Akle, DeCoste, Dennis

arXiv.org Machine LearningJan-7-2020

We propose Stochastic Weight Averaging in Parallel (SW AP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models computed independently and in parallel. The resulting models generalize equally well as those trained with small mini-batches but are produced in a substantially shorter time. We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and ImageNet. Stochastic gradient descent (SGD) and its variants are the de-facto methods to train deep neural networks (DNNs). Each iteration of SGD computes an estimate of the objective's gradient by sampling a mini-batch of the available training data and computing the gradient of the loss restricted to the sampled data. A popular strategy to accelerate DNN training is to increase the mini-batch size together with the available computational resources. Larger mini-batches produce more precise gradient estimates; these allow for higher learning rates and achieve larger reductions of the training loss per iteration.

batch size, generalization performance, sw ap, (13 more...)

arXiv.org Machine Learning

2001.02312

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Virginia (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Accelerating Smooth Games by Manipulating Spectral Shapes

Azizian, Waïss, Scieur, Damien, Mitliagkas, Ioannis, Lacoste-Julien, Simon, Gidel, Gauthier

arXiv.org Machine LearningJan-2-2020

We use matrix iteration theory to characterize acceleration in smooth games. We define the spectral shape of a family of games as the set containing all eigenvalues of the Jacobians of standard gradient dynamics in the family. Shapes restricted to the real line represent well-understood classes of problems, like minimization. Shapes spanning the complex plane capture the added numerical challenges in solving smooth games. In this framework, we describe gradient-based methods, such as extragradient, as transformations on the spectral shape. Using this perspective, we propose an optimal algorithm for bilinear games. For smooth and strongly monotone operators, we identify a continuum between convex minimization, where acceleration is possible using Polyak's momentum, and the worst case where gradient descent is optimal. Finally, going beyond first-order methods, we propose an accelerated version of consensus optimization.

accelerating smooth game, convergence, eigenvalue, (14 more...)

arXiv.org Machine Learning

2001.00602

Country:

North America > United States > New York (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Why Expectation and Maximization algorithm not used in Machine Learning while Gradient Descent algorithm used in Machine Learning?

#artificialintelligenceDec-31-2019, 14:50:37 GMT

I know that Newton Raphson, Expectation & Maximization, and Gradient Descent are all known to be optimization methods. Somehow, I wonder why Gradient Descent is chosen to be used in most of Machine Learning applications but I never heard that Expectation & Maximization or Newton Raphson algorithms have been applied.

expectation & maximization, gradient descent algorithm, machine learning

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

No Spurious Local Minima in Deep Quadratic Networks

Kazemipour, Abbas, Larsen, Brett, Druckmann, Shaul

arXiv.org Machine LearningDec-31-2019

Despite their practical success, a theoretical understanding of the loss landscape of neural networks has proven challenging due to the high-dimensional, non-convex, and highly nonlinear structure of such models. In this paper, we characterize the training landscape of the quadratic loss landscape for neural networks with quadratic activation functions. We prove existence of spurious local minima and saddle points which can be escaped easily with probability one when the number of neurons is greater than or equal to the input dimension and the norm of the training samples is used as a regressor. We prove that deep overparameterized neural networks with quadratic activations benefit from similar nice landscape properties. Our theoretical results are independent of data distribution and fill the existing gap in theory for two-layer quadratic neural networks. Finally, we empirically demonstrate convergence to a global minimum for these problems.

global minimum, neural network, stationary point, (13 more...)

arXiv.org Machine Learning

2001.00098

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

Add feedback

Stochastic Recursive Variance Reduction for Efficient Smooth Non-Convex Compositional Optimization

Yuan, Huizhuo, Lian, Xiangru, Liu, Ji

arXiv.org Machine LearningDec-31-2019

Stochastic compositional optimization arises in many important machine learning tasks such as value function evaluation in reinforcement learning and portfolio management. The objective function is the composition of two expectations of stochastic functions, and is more challenging to optimize than vanilla stochastic optimization problems. In this paper, we investigate the stochastic compositional optimization in the general smooth non-convex setting. We employ a recently developed idea of \textit{Stochastic Recursive Gradient Descent} to design a novel algorithm named SARAH-Compositional, and prove a sharp Incremental First-order Oracle (IFO) complexity upper bound for stochastic compositional optimization: $\mathcal{O}((n+m)^{1/2} \varepsilon^{-2})$ in the finite-sum case and $\mathcal{O}(\varepsilon^{-3})$ in the online case. Such a complexity is known to be the best one among IFO complexity results for non-convex stochastic compositional optimization, and is believed to be optimal. Our experiments validate the theoretical performance of our algorithm.

algorithm, ifo complexity, optimization, (16 more...)

arXiv.org Machine Learning

1912.13515

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Variance Reduced Local SGD with Lower Communication Complexity

Liang, Xianfeng, Shen, Shuheng, Liu, Jingchang, Pan, Zhen, Chen, Enhong, Cheng, Yifei

arXiv.org Machine LearningDec-30-2019

To accelerate the training of machine learning models, distributed stochastic gradient descent (SGD) and its variants have been widely adopted, which apply multiple workers in parallel to speed up training. Among them, Local SGD has gained much attention due to its lower communication cost. Nevertheless, when the data distribution on workers is non-identical, Local SGD requires $O(T^{\frac{3}{4}} N^{\frac{3}{4}})$ communications to maintain its \emph{linear iteration speedup} property, where $T$ is the total number of iterations and $N$ is the number of workers. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to further reduce the communication complexity. Benefiting from eliminating the dependency on the gradient variance among workers, we theoretically prove that VRL-SGD achieves a \emph{linear iteration speedup} with a lower communication complexity $O(T^{\frac{1}{2}} N^{\frac{3}{2}})$ even if workers access non-identical datasets. We conduct experiments on three machine learning tasks, and the experimental results demonstrate that VRL-SGD performs impressively better than Local SGD when the data among workers are quite diverse.

local sgd, variance, vrl-sgd, (11 more...)

arXiv.org Machine Learning

1912.12844

Country:

Asia > China > Hong Kong (0.04)
Asia > China > Anhui Province (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Federated Variance-Reduced Stochastic Gradient Descent with Robustness to Byzantine Attacks

Wu, Zhaoxian, Ling, Qing, Chen, Tianyi, Giannakis, Georgios B.

arXiv.org Machine LearningDec-29-2019

This paper deals with distributed finite-sum optimization for learning over networks in the presence of malicious Byzantine attacks. To cope with such attacks, most resilient approaches so far combine stochastic gradient descent (SGD) with different robust aggregation rules. However, the sizeable SGD-induced stochastic gradient noise makes it challenging to distinguish malicious messages sent by the Byzantine attackers from noisy stochastic gradients sent by the 'honest' workers. This motivates us to reduce the variance of stochastic gradients as a means of robustifying SGD in the presence of Byzantine attacks. To this end, the present work puts forth a Byzantine attack resilient distributed (Byrd-) SAGA approach for learning tasks involving finite-sum optimization over networks. Rather than the mean employed by distributed SAGA, the novel Byrd- SAGA relies on the geometric median to aggregate the corrected stochastic gradients sent by the workers. When less than half of the workers are Byzantine attackers, the robustness of geometric median to outliers enables Byrd-SAGA to attain provably linear convergence to a neighborhood of the optimal solution, with the asymptotic learning error determined by the number of Byzantine workers. Numerical tests corroborate the robustness to various Byzantine attacks, as well as the merits of Byrd- SAGA over Byzantine attack resilient distributed SGD.

byzantine attack, gradient, stochastic gradient, (12 more...)

arXiv.org Machine Learning

1912.12716

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
North America > United States > Nevada (0.04)
(10 more...)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Gradient descent for linear regression using Golang - Backlog

#artificialintelligenceDec-28-2019, 03:37:52 GMT

I recently decided to dive into machine learning, a field I have wanted to understand for a long time but have never had the time to pursue. I've been taking the free (and amazing!) course from Stanford University's Andrew Ng on Coursera. The first two weeks are dedicated to the Linear Gradient algorithm. In this post, I'll provide an overview of how it works and share how I implemented the vectorized version and parts of the non-vectorized version in Golang using the gonum library. Linear regression is a technique used in modeling the linear relationship between an input and its output.

algorithm, cost function, hypothesis, (13 more...)

#artificialintelligence

Country: Europe > Netherlands > North Holland > Amsterdam (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

Understanding Gradient Descent And Its Variants

#artificialintelligenceDec-26-2019, 10:34:28 GMT

Machine learning models are fantastic; they can recognize objects in videos; they can automatically generate captions for images and accurately classify pictures of cats and dogs (sometimes). This article will provide a surface level understanding of what happens underneath the hood of Machine learning models. More specifically, we will be exploring the'backbone algorithms' that enable these machine learning models to learn. The'backbone algorithms' are called Optimization algorithms. Below are some definitions of keywords you will encounter within this article, and optimization algorithm is amongst the provided descriptions.

algorithm, gradient descent, parameter value, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

Spurious Local Minima of Shallow ReLU Networks Conform with the Symmetry of the Target Model

Arjevani, Yossi, Field, Michael

arXiv.org Machine LearningDec-26-2019

We consider the optimization problem associated with fitting two-layer ReLU networks with respect to the squared loss, where labels are assumed to be generated by a target network. Focusing first on standard Gaussian inputs, we show that the structure of spurious local minima detected by stochastic gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of symmetry} with respect to the target weights. A closer look at the analysis indicates then that this principle of least symmetry breaking may apply to a broader range of settings. Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers.

critical point, matrix, subgroup, (15 more...)

arXiv.org Machine Learning

1912.11939

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback