AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

An Empirical Analysis of Measure-Valued Derivatives for Policy Gradients

Carvalho, João, Tateo, Davide, Muratore, Fabio, Peters, Jan

arXiv.org Artificial IntelligenceJul-20-2021

Reinforcement learning methods for robotics are increasingly successful due to the constant development of better policy gradient techniques. A precise (low variance) and accurate (low bias) gradient estimator is crucial to face increasingly complex tasks. Traditional policy gradient algorithms use the likelihood-ratio trick, which is known to produce unbiased but high variance estimates. More modern approaches exploit the reparametrization trick, which gives lower variance gradient estimates but requires differentiable value function approximators. In this work, we study a different type of stochastic gradient estimator: the Measure-Valued Derivative. This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators. We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2107.09359

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Asia > Middle East > Jordan (0.04)
(4 more...)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Kernel Selection for Stein Variational Gradient Descent

Ai, Qingzhong, Liu, Shiyu, Xu, Zenglin

arXiv.org Machine LearningJul-20-2021

Stein variational gradient descent (SVGD) and its variants have shown promising successes in approximate inference for complex distributions. However, their empirical performance depends crucially on the choice of optimal kernel. Unfortunately, RBF kernel with median heuristics is a common choice in previous approaches which has been proved sub-optimal. Inspired by the paradigm of multiple kernel learning, our solution to this issue is using a combination of multiple kernels to approximate the optimal kernel instead of a single one which may limit the performance and flexibility. To do so, we extend Kernelized Stein Discrepancy (KSD) to its multiple kernel view called Multiple Kernelized Stein Discrepancy (MKSD). Further, we leverage MKSD to construct a general algorithm based on SVGD, which be called Multiple Kernel SVGD (MK-SVGD). Besides, we automatically assign a weight to each kernel without any other parameters. The proposed method not only gets rid of optimal kernel dependence but also maintains computational effectiveness. Experiments on various tasks and models show the effectiveness of our method.

kernel, svgd, target distribution, (11 more...)

arXiv.org Machine Learning

2107.09338

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
Asia > Middle East > Jordan (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)

Add feedback

Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent

Frei, Spencer, Gu, Quanquan

arXiv.org Machine LearningJul-20-2021

Understanding the ability of gradient-based stochastic optimization algorithms to find good minima of non-convex objective functions has become an especially important problem due to the success of stochastic gradient descent (SGD) in learning deep neural networks. Although there exist non-convex objective functions and domains for which SGD will necessarily lead to sub-optimal local minima, it appears that for many problems of interest in deep learning, across domains as varied as natural language and images, these worst-case situations do not arise. Indeed, a number of recent works have developed provable guarantees for GD and SGD when used for objective functions defined in terms of neural networks over certain distributions, despite the non-convexity of the underlying optimization problem (Brutzkus et al., 2018; Allen-Zhu et al., 2019; Cao and Gu, 2020; Ji and Telgarsky, 2020; Frei et al., 2020, 2021b). To date, however, there has not been a framework which could unify the variegated approaches for guarantees in these settings. In this work, we introduce the notion of proxy convexity and demonstrate that many existing provable guarantees for learning with neural networks trained by gradient-based optimization fall into a problem satisfying proxy convexity.

inequality, neural network, pl inequality, (13 more...)

arXiv.org Machine Learning

2106.13792

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Edge of chaos as a guiding principle for modern neural network training

Zhang, Lin, Feng, Ling, Chen, Kan, Lai, Choy Heng

arXiv.org Artificial IntelligenceJul-20-2021

The success of deep neural networks in real-world problems has prompted many attempts to explain their training dynamics and generalization performance, but more guiding principles for the training of neural networks are still needed. Motivated by the edge of chaos principle behind the optimal performance of neural networks, we study the role of various hyperparameters in modern neural network training algorithms in terms of the order-chaos phase diagram. In particular, we study a fully analytical feedforward neural network trained on the widely adopted Fashion-MNIST dataset, and study the dynamics associated with the hyperparameters in back-propagation during the training process. We find that for the basic algorithm of stochastic gradient descent with momentum, in the range around the commonly used hyperparameter values, clear scaling relations are present with respect to the training time during the ordered phase in the phase diagram, and the model's optimal generalization power at the edge of chaos is similar across different training parameter combinations. In the chaotic phase, the same scaling no longer exists. The scaling allows us to choose the training parameters to achieve faster training without sacrificing performance. In addition, we find that the commonly used model regularization method - weight decay - effectively pushes the model towards the ordered phase to achieve better performance. Leveraging on this fact and the scaling relations in the other hyperparameters, we derived a principled guideline for hyperparameter determination, such that the model can achieve optimal performance by saturating it at the edge of chaos. Demonstrated on this simple neural network model and training algorithm, our work improves the understanding of neural network training dynamics, and can potentially be extended to guiding principles of more complex model architectures and algorithms.

chaos, neural network, training parameter, (14 more...)

arXiv.org Artificial Intelligence

2107.09437

Country: Asia > Singapore > Central Region > Singapore (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

Asymptotic Escape of Spurious Critical Points on the Low-rank Matrix Manifold

Hou, Thomas Y., Li, Zhenzhen, Zhang, Ziyun

arXiv.org Machine LearningJul-19-2021

We show that the Riemannian gradient descent algorithm on the low-rank matrix manifold almost surely escapes some spurious critical points on the boundary of the manifold. Given that the low-rank matrix manifold is an incomplete set, this result is the first to overcome this difficulty and partially justify the global use of the Riemannian gradient descent on the manifold. The spurious critical points are some rank-deficient matrices that capture only part of the SVD components of the ground truth. They exhibit very singular behavior and evade the classical analysis of strict saddle points. We show that using the dynamical low-rank approximation and a rescaled gradient flow, some of the spurious critical points can be converted to classical strict saddle points, which leads to the desired result. Numerical experiments are provided to support our theoretical findings.

gradient flow, saddle point, spurious critical point, (15 more...)

arXiv.org Machine Learning

2107.09207

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)

Add feedback

Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion

Kunin, Daniel, Sagastuy-Brena, Javier, Gillespie, Lauren, Margalit, Eshed, Tanaka, Hidenori, Ganguli, Surya, Yamins, Daniel L. K.

arXiv.org Machine LearningJul-19-2021

In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). We find empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.

equation, expression, trajectory, (15 more...)

arXiv.org Machine Learning

2107.09133

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.63)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

Structured Stochastic Gradient MCMC

Alexos, Antonios, Boyd, Alex, Mandt, Stephan

arXiv.org Machine LearningJul-19-2021

Stochastic gradient Markov chain Monte Carlo (SGMCMC) is considered the gold standard for Bayesian inference in large-scale models, such as Bayesian neural networks. Since practitioners face speed versus accuracy tradeoffs in these models, variational inference (VI) is often the preferable option. Unfortunately, VI makes strong assumptions on both the factorization and functional form of the posterior. In this work, we propose a new non-parametric variational approximation that makes no assumptions about the approximate posterior's functional form and allows practitioners to specify the exact dependencies the algorithm should respect or break. The approach relies on a new Langevin-type algorithm that operates on a modified energy function, where parts of the latent variables are averaged over samples from earlier iterations of the Markov chain. This way, statistical dependencies can be broken in a controlled way, allowing the chain to mix faster. This scheme can be further modified in a ''dropout'' manner, leading to even more scalability. By implementing the scheme on a ResNet-20 architecture, we obtain better predictive likelihoods and larger effective sample sizes than full SGMCMC.

algorithm, posterior distribution, variational inference, (13 more...)

arXiv.org Machine Learning

2107.09028

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Orange County > Irvine (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

Add feedback

Non-asymptotic estimates for TUSLA algorithm for non-convex learning with applications to neural networks with ReLU activation function

Lim, Dong-Young, Neufeld, Ariel, Sabanis, Sotirios, Zhang, Ying

arXiv.org Machine LearningJul-19-2021

We consider non-convex stochastic optimization problems where the objective functions have super-linearly growing and discontinuous stochastic gradients. In such a setting, we provide a non-asymptotic analysis for the tamed unadjusted stochastic Langevin algorithm (TUSLA) introduced in Lovas et al. (2021). In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wasserstein-1 and Wasserstein-2 distances. The latter result enables us to further derive non-asymptotic estimates for the expected excess risk. To illustrate the applicability of the main results, we consider an example from transfer learning with ReLU neural networks, which represents a key paradigm in machine learning. Numerical experiments are presented for the aforementioned example which supports our theoretical findings. Hence, in this setting, we demonstrate both theoretically and numerically that the TUSLA algorithm can solve the optimization problem involving neural networks with ReLU activation function. Besides, we provide simulation results for synthetic examples where popular algorithms, e.g. ADAM, AMSGrad, RMSProp, and (vanilla) SGD, may fail to find the minimizer of the objective functions due to the super-linear growth and the discontinuity of the corresponding stochastic gradient, while the TUSLA algorithm converges rapidly to the optimal solution.

application, lemma 4, non-asymptotic estimate, (16 more...)

arXiv.org Machine Learning

2107.08649

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

Gentle Introduction to Gradient Descent and Momentum

#artificialintelligenceJul-18-2021, 09:15:52 GMT

In this article, we will talk about a fundamental concept in machine learning called the Gradient Descent. The gradient descent is one of the most popular algorithms that tends to reduce the error in prediction i.e minimizing your cost function. This might have been confusing but that's okay, before we jump into more details I'll give a very small gist of where it is mostly used. In deep learning, we have a concept called backpropagation. Wikipedia says " backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naïve direct computation of the gradient with respect to each weight individually" I had a brain-freeze when I read this, so let me give you an intuitive example to help you understand better.

gradient descent, gradient descent and momentum, minima, (3 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)

Add feedback

A Glance at Optimization algorithms for Deep Learning

#artificialintelligenceJul-17-2021, 00:30:09 GMT

Batch Gradient Descent, Mini-batch Gradient Descent and Stochastic Gradient Descent are techniques used for gradient optimization differ in the batch size they use for computing gradients in each iteration. Gradient Descent uses all the data to compute gradients and update weights in each iteration. Minibatch Gradient Descent takes a subset of dataset to update its weights in each iteration. It however takes more iterations to converge to minima, but it is faster as compared to Gradient Descent due to lesser size of batch data used. Stochastic Gradient Descent (SGD) (or also sometimes on-line gradient descent) is the extreme case of this.

gradient descent, iteration, optimization algorithm, (6 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback