AITopics

We consider an existing conjecture addressing the asymptotic behavior of neural networks in the large width limit. The results that follow from this conjecture include tight bounds on the behavior of wide networks during stochastic gradient descent, and a derivation of their finite-width dynamics. We prove the conjecture for deep networks with polynomial activation functions, greatly extending the validity of these results. Finally, we point out a difference in the asymptotic behavior of networks with analytic (and non-linear) activation functions and those with piecewise-linear activations such as ReLU.

artificial intelligence, correlation function, machine learning, (16 more...)

2006.06687

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Hosseini, Mahdi S., Plataniotis, Konstantinos N.

AdaS: Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to tune the step-size requires extensive practical experience--offering limited insight into how the parameters update--and is not consistent across applications. This work attempts to answer a question of interest to both researchers and practitioners, namely \textit{"how much knowledge is gained in iterative training of deep neural networks?"} Answering this question introduces two useful metrics derived from the singular values of the low-rank factorization of convolution layers in deep neural networks. We introduce the notions of \textit{"knowledge gain"} and \textit{"mapping condition"} and propose a new algorithm called Adaptive Scheduling (AdaS) that utilizes these derived metrics to adapt the SGD learning rate proportionally to the rate of change in knowledge gain over successive iterations. Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training. Code is available at \url{https://github.com/mahdihosseini/AdaS}.

artificial intelligence, knowledge gain, machine learning, (18 more...)

2006.06587

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Shen, Shuheng, Cheng, Yifei, Liu, Jingchang, Xu, Linli

Distributed parallel stochastic gradient descent algorithms are workhorses for large scale machine learning tasks. Among them, local stochastic gradient descent (Local SGD) has attracted significant attention due to its low communication complexity. Previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communication period is in the order of $O (N^{\frac{3}{2}} T^{\frac{1}{2}})$ and $O (N^{\frac{3}{4}} T^{\frac{3}{4}})$ when the data distributions on clients are identical (IID) or otherwise (Non-IID). In this paper, to accelerate the convergence by reducing the communication complexity, we propose \textit{ST}agewise \textit{L}ocal \textit{SGD} (STL-SGD), which increases the communication period gradually along with decreasing learning rate. We prove that STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD. In addition, as the benefit of increasing the communication period, when the objective is strongly convex or satisfies the Polyak-\L ojasiewicz condition, the communication complexity of STL-SGD is $O (N \log{T})$ and $O (N^{\frac{1}{2}} T^{\frac{1}{2}})$ for the IID case and the Non-IID case respectively, achieving significant improvements over Local SGD. Experiments on both convex and non-convex problems demonstrate the superior performance of STL-SGD.

artificial intelligence, local sgd, machine learning, (15 more...)

2006.06377

Country:

Asia > China > Hong Kong (0.04)
Asia > China > Anhui Province (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)

Hodgkinson, Liam, Mahoney, Michael W.

Multiplicative noise and heavy tails in stochastic optimization

Stochastic optimization is the process of minimizing a deterministic objective function via the simulation of random elements, and it is one of the most successful methods for optimizing complex or unknown objectives. Relatively simple stochastic optimization procedures--in particular, stochastic gradient descent (SGD)--have become the backbone of modern machine learning (ML) [50]. To improve understanding of stochastic optimization in ML, and particularly why SGD and its extensions work so well, recent theoretical work has sought to study its properties and dynamics [47]. Such analyses typically approach the problem through one of two perspectives. The first perspective, an optimization (or quenching) perspective, examines convergence either in expectation [11, 20, 28, 60, 84] or with some positive (high) probability [19, 41, 66, 77] through the lens of a deterministic counterpart. This perspective inherits some limitations of deterministic optimizers, including assumptions (e.g., convexity, Polyak-Łojasiewicz criterion, etc.) that are either not satisfied by state-of-the-art problems, or not strong enough to imply convergence to a quality (e.g., global) optimum. More concerning, however, is the inability to explain what has come to be known as the "generalization gap" phenomenon: increasing stochasticity by reducing batch size appears to improve generalization performance [38, 55]. Empirically, existing strategies do tend to break down for inference tasks when using large batch sizes [27].

artificial intelligence, machine learning, stationary distribution, (15 more...)

2006.06293

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Government > Military (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

#artificialintelligenceJun-10-2020, 16:01:32 GMT

A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam)

With a myriad of resources out there explaining gradient descents, in this post, I'd like to visually walk you through how each of these methods works. With the aid of a gradient descent visualization tool I built, hopefully I can present you with some unique insights, or minimally, many GIFs. I assume basic familiarity of why and how gradient descent is used in machine learning (if not, I recommend this video by 3Blue1Brown) . My focus here is to compare and contrast these methods. If you are already familiar with all the methods, you can scroll to the bottom to watch a few fun "horse races".

artificial intelligence, gradient descent, machine learning, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Machine LearningJun-10-2020

Borrowing From the Future: Addressing Double Sampling in Model-free Control

Zhu, Yuhua, Izzo, Zach, Ying, Lexing

In model-free reinforcement learning, the temporal difference method and its variants become unstable when combined with nonlinear function approximations. Bellman residual minimization with stochastic gradient descent (SGD) is more stable, but it suffers from the double sampling problem: given the current state, two independent samples for the next state are required, but often only one sample is available. Recently, the authors of Zhu et al. [2020] introduced the borrowing from the future (BFF) algorithm to address this issue for the prediction problem. The main idea is to borrow extra randomness from the future to approximately re-sample the next state when the underlying dynamics of the problem are sufficiently smooth. This paper extends the BFF algorithm to action-value function based model-free control. We prove that BFF is close to unbiased SGD when the underlying dynamics vary slowly with respect to actions. We confirm our theoretical findings with numerical simulations.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

2006.06173

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Majka, Mateusz B., Sabate-Vidales, Marc, Szpruch, Łukasz

Multi-index Antithetic Stochastic Gradient Algorithm

arXiv.org Machine LearningJun-10-2020

Stochastic Gradient Algorithms (SGAs) are ubiquitous in computational statistics, machine learning and optimisation. Recent years have brought an influx of interest in SGAs and the non-asymptotic analysis of their bias is by now well-developed. However, in order to fully understand the efficiency of Monte Carlo algorithms utilizing stochastic gradients, one also needs to carry out the analysis of their variance, which turns out to be problem-specific. For this reason, there is no systematic theory that would specify the optimal choice of the random approximation of the gradient in SGAs for a given data regime. Furthermore, while there have been numerous attempts to reduce the variance of SGAs, these typically exploit a particular structure of the sampled distribution. In this paper we use the Multi-index Monte Carlo apparatus combined with the antithetic approach to construct the Multi-index Antithetic Stochastic Gradient Algorithm (MASGA), which can be used to sample from any probability distribution. This, to our knowledge, is the first SGA that, for all data regimes and without relying on any specific structure of the target measure, achieves performance on par with Monte Carlo estimators that have access to unbiased samples from the distribution of interest. In other words, MASGA is an optimal estimator from the error-computational cost perspective within the class of Monte Carlo estimators.

artificial intelligence, estimator, machine learning, (17 more...)

2006.06102

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Mignacco, Francesca, Krzakala, Florent, Urbani, Pierfrancesco, Zdeborová, Lenka

Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification

arXiv.org Machine LearningJun-10-2020

We analyze in a closed form the learning dynamics of stochastic gradient descent (SGD) for a single layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels. This problem provides a prototype of a non-convex loss landscape with interpolating regimes and a large generalization gap. We define a particular stochastic process for which SGD can be extended to a continuous-time limit that we call stochastic gradient flow. In the full-batch limit we recover the standard gradient flow. We apply dynamical mean-field theory from statistical physics to track the dynamics of the algorithm in the high-dimensional limit via a self-consistent stochastic process. We explore the performance of the algorithm as a function of control parameters shedding light on how it navigates the loss landscape.

artificial intelligence, gradient descent, machine learning, (17 more...)

2006.06098

Country:

North America > United States (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Franzese, Giulio, Candela, Rosa, Milios, Dimitrios, Filippone, Maurizio, Michiardi, Pietro

Isotropic SGD: a Practical Approach to Bayesian Posterior Sampling

arXiv.org Machine LearningJun-9-2020

In this work we define a unified mathematical framework to deepen our understanding of the role of stochastic gradient (SG) noise on the behavior of Markov chain Monte Carlo sampling (SGMCMC) algorithms. Our formulation unlocks the design of a novel, practical approach to posterior sampling, which makes the SG noise isotropic using a fixed learning rate that we determine analytically, and that requires weaker assumptions than existing algorithms. In contrast, the common traits of existing \sgmcmc algorithms is to approximate the isotropy condition either by drowning the gradients in additive noise (annealing the learning rate) or by making restrictive assumptions on the \sg noise covariance and the geometry of the loss landscape. Extensive experimental validations indicate that our proposal is competitive with the state-of-the-art on \sgmcmc, while being much more practical to use.

neural network, posterior, upstream oil & gas, (20 more...)

2006.05087

Country: Europe > Italy (0.14)

Genre: Research Report (0.64)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Bensoussan, Alain, Li, Yiqun, Nguyen, Dinh Phan Cao, Tran, Minh-Binh, Yam, Sheung Chi Phillip, Zhou, Xiang

Machine Learning and Control Theory

arXiv.org Machine LearningJun-9-2020

We survey in this article the connections between Machine Learning and Control Theory. Control Theory provide useful concepts and tools for Machine Learning. Conversely Machine Learning can be used to solve large control problems. In the first part of the paper, we develop the connections between reinforcement learning and Markov Decision Processes, which are discrete time control problems. In the second part, we review the concept of supervised learning and the relation with static optimization. Deep learning which extends supervised learning, can be viewed as a control problem. In the third part, we present the links between stochastic gradient descent and mean-field theory. Conversely, in the fourth and fifth parts, we review machine learning approaches to stochastic control problems, and focus on the deterministic case, to explain, more easily, the numerical algorithms.

artificial intelligence, bn 1, machine learning, (13 more...)

2006.05604

Country:

Asia > China > Hong Kong (0.05)
North America > United States > Texas > Dallas County > Dallas (0.04)
Asia > Vietnam > Khánh Hòa Province > Nha Trang (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)