Gradient Descent
An Empirical Analysis of Measure-Valued Derivatives for Policy Gradients
Carvalho, João, Tateo, Davide, Muratore, Fabio, Peters, Jan
Reinforcement learning methods for robotics are increasingly successful due to the constant development of better policy gradient techniques. A precise (low variance) and accurate (low bias) gradient estimator is crucial to face increasingly complex tasks. Traditional policy gradient algorithms use the likelihood-ratio trick, which is known to produce unbiased but high variance estimates. More modern approaches exploit the reparametrization trick, which gives lower variance gradient estimates but requires differentiable value function approximators. In this work, we study a different type of stochastic gradient estimator: the Measure-Valued Derivative. This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators. We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces.
Kernel Selection for Stein Variational Gradient Descent
Ai, Qingzhong, Liu, Shiyu, Xu, Zenglin
Stein variational gradient descent (SVGD) and its variants have shown promising successes in approximate inference for complex distributions. However, their empirical performance depends crucially on the choice of optimal kernel. Unfortunately, RBF kernel with median heuristics is a common choice in previous approaches which has been proved sub-optimal. Inspired by the paradigm of multiple kernel learning, our solution to this issue is using a combination of multiple kernels to approximate the optimal kernel instead of a single one which may limit the performance and flexibility. To do so, we extend Kernelized Stein Discrepancy (KSD) to its multiple kernel view called Multiple Kernelized Stein Discrepancy (MKSD). Further, we leverage MKSD to construct a general algorithm based on SVGD, which be called Multiple Kernel SVGD (MK-SVGD). Besides, we automatically assign a weight to each kernel without any other parameters. The proposed method not only gets rid of optimal kernel dependence but also maintains computational effectiveness. Experiments on various tasks and models show the effectiveness of our method.
Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent
Understanding the ability of gradient-based stochastic optimization algorithms to find good minima of non-convex objective functions has become an especially important problem due to the success of stochastic gradient descent (SGD) in learning deep neural networks. Although there exist non-convex objective functions and domains for which SGD will necessarily lead to sub-optimal local minima, it appears that for many problems of interest in deep learning, across domains as varied as natural language and images, these worst-case situations do not arise. Indeed, a number of recent works have developed provable guarantees for GD and SGD when used for objective functions defined in terms of neural networks over certain distributions, despite the non-convexity of the underlying optimization problem (Brutzkus et al., 2018; Allen-Zhu et al., 2019; Cao and Gu, 2020; Ji and Telgarsky, 2020; Frei et al., 2020, 2021b). To date, however, there has not been a framework which could unify the variegated approaches for guarantees in these settings. In this work, we introduce the notion of proxy convexity and demonstrate that many existing provable guarantees for learning with neural networks trained by gradient-based optimization fall into a problem satisfying proxy convexity.
Edge of chaos as a guiding principle for modern neural network training
Zhang, Lin, Feng, Ling, Chen, Kan, Lai, Choy Heng
The success of deep neural networks in real-world problems has prompted many attempts to explain their training dynamics and generalization performance, but more guiding principles for the training of neural networks are still needed. Motivated by the edge of chaos principle behind the optimal performance of neural networks, we study the role of various hyperparameters in modern neural network training algorithms in terms of the order-chaos phase diagram. In particular, we study a fully analytical feedforward neural network trained on the widely adopted Fashion-MNIST dataset, and study the dynamics associated with the hyperparameters in back-propagation during the training process. We find that for the basic algorithm of stochastic gradient descent with momentum, in the range around the commonly used hyperparameter values, clear scaling relations are present with respect to the training time during the ordered phase in the phase diagram, and the model's optimal generalization power at the edge of chaos is similar across different training parameter combinations. In the chaotic phase, the same scaling no longer exists. The scaling allows us to choose the training parameters to achieve faster training without sacrificing performance. In addition, we find that the commonly used model regularization method - weight decay - effectively pushes the model towards the ordered phase to achieve better performance. Leveraging on this fact and the scaling relations in the other hyperparameters, we derived a principled guideline for hyperparameter determination, such that the model can achieve optimal performance by saturating it at the edge of chaos. Demonstrated on this simple neural network model and training algorithm, our work improves the understanding of neural network training dynamics, and can potentially be extended to guiding principles of more complex model architectures and algorithms.
Asymptotic Escape of Spurious Critical Points on the Low-rank Matrix Manifold
Hou, Thomas Y., Li, Zhenzhen, Zhang, Ziyun
We show that the Riemannian gradient descent algorithm on the low-rank matrix manifold almost surely escapes some spurious critical points on the boundary of the manifold. Given that the low-rank matrix manifold is an incomplete set, this result is the first to overcome this difficulty and partially justify the global use of the Riemannian gradient descent on the manifold. The spurious critical points are some rank-deficient matrices that capture only part of the SVD components of the ground truth. They exhibit very singular behavior and evade the classical analysis of strict saddle points. We show that using the dynamical low-rank approximation and a rescaled gradient flow, some of the spurious critical points can be converted to classical strict saddle points, which leads to the desired result. Numerical experiments are provided to support our theoretical findings.
Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion
Kunin, Daniel, Sagastuy-Brena, Javier, Gillespie, Lauren, Margalit, Eshed, Tanaka, Hidenori, Ganguli, Surya, Yamins, Daniel L. K.
In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). We find empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.
Structured Stochastic Gradient MCMC
Alexos, Antonios, Boyd, Alex, Mandt, Stephan
Stochastic gradient Markov chain Monte Carlo (SGMCMC) is considered the gold standard for Bayesian inference in large-scale models, such as Bayesian neural networks. Since practitioners face speed versus accuracy tradeoffs in these models, variational inference (VI) is often the preferable option. Unfortunately, VI makes strong assumptions on both the factorization and functional form of the posterior. In this work, we propose a new non-parametric variational approximation that makes no assumptions about the approximate posterior's functional form and allows practitioners to specify the exact dependencies the algorithm should respect or break. The approach relies on a new Langevin-type algorithm that operates on a modified energy function, where parts of the latent variables are averaged over samples from earlier iterations of the Markov chain. This way, statistical dependencies can be broken in a controlled way, allowing the chain to mix faster. This scheme can be further modified in a ''dropout'' manner, leading to even more scalability. By implementing the scheme on a ResNet-20 architecture, we obtain better predictive likelihoods and larger effective sample sizes than full SGMCMC.
Non-asymptotic estimates for TUSLA algorithm for non-convex learning with applications to neural networks with ReLU activation function
Lim, Dong-Young, Neufeld, Ariel, Sabanis, Sotirios, Zhang, Ying
We consider non-convex stochastic optimization problems where the objective functions have super-linearly growing and discontinuous stochastic gradients. In such a setting, we provide a non-asymptotic analysis for the tamed unadjusted stochastic Langevin algorithm (TUSLA) introduced in Lovas et al. (2021). In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wasserstein-1 and Wasserstein-2 distances. The latter result enables us to further derive non-asymptotic estimates for the expected excess risk. To illustrate the applicability of the main results, we consider an example from transfer learning with ReLU neural networks, which represents a key paradigm in machine learning. Numerical experiments are presented for the aforementioned example which supports our theoretical findings. Hence, in this setting, we demonstrate both theoretically and numerically that the TUSLA algorithm can solve the optimization problem involving neural networks with ReLU activation function. Besides, we provide simulation results for synthetic examples where popular algorithms, e.g. ADAM, AMSGrad, RMSProp, and (vanilla) SGD, may fail to find the minimizer of the objective functions due to the super-linear growth and the discontinuity of the corresponding stochastic gradient, while the TUSLA algorithm converges rapidly to the optimal solution.
Gentle Introduction to Gradient Descent and Momentum
In this article, we will talk about a fundamental concept in machine learning called the Gradient Descent. The gradient descent is one of the most popular algorithms that tends to reduce the error in prediction i.e minimizing your cost function. This might have been confusing but that's okay, before we jump into more details I'll give a very small gist of where it is mostly used. In deep learning, we have a concept called backpropagation. Wikipedia says " backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naïve direct computation of the gradient with respect to each weight individually" I had a brain-freeze when I read this, so let me give you an intuitive example to help you understand better.
A Glance at Optimization algorithms for Deep Learning
Batch Gradient Descent, Mini-batch Gradient Descent and Stochastic Gradient Descent are techniques used for gradient optimization differ in the batch size they use for computing gradients in each iteration. Gradient Descent uses all the data to compute gradients and update weights in each iteration. Minibatch Gradient Descent takes a subset of dataset to update its weights in each iteration. It however takes more iterations to converge to minima, but it is faster as compared to Gradient Descent due to lesser size of batch data used. Stochastic Gradient Descent (SGD) (or also sometimes on-line gradient descent) is the extreme case of this.