Gradient Descent
Manifold Optimisation Assisted Gaussian Variational Approximation
Zhou, Bingxin, Gao, Junbin, Tran, Minh-Ngoc, Gerlach, Richard
Variational approximation methods are a way to approximate the posterior in Bayesian inference especially when the dataset has a large volume or high dimension. Factor covariance structure was introduced in previous work with three restrictions to handle the problem of computational infeasibility in Gaussian approximation. However, the three strong constraints on the covariance matrix could possibly break down during the process of the structure optimization, and the identification issue could still possibly exist within the final approximation. In this paper, we consider two types of manifold parameterization, Stiefel manifold and Grassmann manifold, to address the problems. Moreover, the Riemannian stochastic gradient descent method is applied to solve the resulting optimization problem while maintaining the orthogonal factors. Results from two experiments demonstrate that our model fixes the potential issue of the previous method with comparable accuracy and competitive converge speed even in high-dimensional problems.
A stochastic version of Stein Variational Gradient Descent for efficient sampling
Li, Lei, Liu, Jian-Guo, Liu, Zibu, Lu, Jianfeng
The empirical measure with samples from some probability measure (which might be known up to a multiplicative factor) has many applications in Bayesian inference [1, 2] and data assimilation [3]. A class of widely used sampling methods is the Markov Chain Monte Carlo (MCMC) methods, where the trajectory of a particle is given by some constructed Markov chain with the desired distribution invariant. The trajectory of the particle is clearly stochastic, and the Monte Carlo methods take effect slowly for small number of samples. Unlike MCMC, the Stein variational Gradient method (proposed by Liu and Wang in [4]) belongs to particle based variational inference sampling methods (see also [5, 6]). These methods update particles by solving optimization problems, and each iteration is expected to make progress. As a nonparametric variational inference method, SVGD gives a deterministic way to generate points that approximate the desired probability distribution by solving an ODE system.
On the convergence rate of stochastic proximal point algorithm without strong convexity, smoothness or bounded gradients
Significant parts of the recent learning literature on stochastic optimization algorithms focused on the theoretical and practical behaviour of stochastic first order schemes under different convexity properties. Due to its simplicity, the traditional method of choice for most supervised machine learning problems is the stochastic gradient descent (SGD) method. Many iteration improvements and accelerations have been added to the pure SGD in order to boost its convergence in various (strong) convexity setting. However, the Lipschitz gradient continuity or bounded gradients assumptions are an essential requirement for most existing stochastic first-order schemes. In this paper novel convergence results are presented for the stochastic proximal point algorithm in different settings. In particular, without any strong convexity, smoothness or bounded gradients assumptions, we show that a slightly modified quadratic growth assumption is sufficient to guarantee for the stochastic proximal point $\mathcal{O}\left(\frac{1}{k}\right)$ convergence rate, in terms of the distance to the optimal set. Furthermore, linear convergence is obtained for interpolation setting, when the optimal set of expected cost is included in the optimal sets of each functional component.
Combining learning rate decay and weight decay with complexity gradient descent - Part I
Richemond, Pierre H., Guo, Yike
The role of $L^2$ regularization, in the specific case of deep neural networks rather than more traditional machine learning models, is still not fully elucidated. We hypothesize that this complex interplay is due to the combination of overparameterization and high dimensional phenomena that take place during training and make it unamenable to standard convex optimization methods. Using insights from statistical physics and random fields theory, we introduce a parameter factoring in both the level of the loss function and its remaining nonconvexity: the \emph{complexity}. We proceed to show that it is desirable to proceed with \emph{complexity gradient descent}. We then show how to use this intuition to derive novel and efficient annealing schemes for the strength of $L^2$ regularization when performing standard stochastic gradient descent in deep neural networks.
Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks
Can multilayer neural networks -- typically constructed as highly complex structures with many nonlinearly activated neurons across layers -- behave in a non-trivial way that yet simplifies away a major part of their complexities? In this work, we uncover a phenomenon in which the behavior of these complex networks -- under suitable scalings and stochastic gradient descent dynamics -- becomes independent of the number of neurons as this number grows sufficiently large. We develop a formalism in which this many-neurons limiting behavior is captured by a set of equations, thereby exposing a previously unknown operating regime of these networks. While the current pursuit is mathematically non-rigorous, it is complemented with several experiments that validate the existence of this behavior.
Compatible Natural Gradient Policy Search
Pajarinen, Joni, Thai, Hong Linh, Akrour, Riad, Peters, Jan, Neumann, Gerhard
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning
Reisizadeh, Amirhossein, Prakash, Saurav, Pedarsani, Ramtin, Avestimehr, Amir Salman
We focus on the commonly used synchronous Gradient Descent paradigm for large-scale distributed learning, for which there has been a growing interest to develop efficient and robust gradient aggregation strategies that overcome two key bottlenecks: communication bandwidth and stragglers' delays. In particular, Ring-AllReduce (RAR) design has been proposed to avoid bandwidth bottleneck at any particular node by allowing each worker to only communicate with its neighbors that are arranged in a logical ring. On the other hand, Gradient Coding (GC) has been recently proposed to mitigate stragglers in a master-worker topology by allowing carefully designed redundant allocation of the data set to the workers. We propose a joint communication topology design and data set allocation strategy, named CodedReduce (CR), that combines the best of both RAR and GC. That is, it parallelizes the communications over a tree topology leading to efficient bandwidth utilization, and carefully designs a redundant data set allocation and coding strategy at the nodes to make the proposed gradient aggregation scheme robust to stragglers. In particular, we quantify the communication parallelization gain and resiliency of the proposed CR scheme, and prove its optimality when the communication topology is a regular tree. Furthermore, we empirically evaluate the performance of our proposed CR design over Amazon EC2 and demonstrate that it achieves speedups of up to 18.9x and 7.9x, respectively over the benchmarks GC and RAR.
Exponentiated Gradient Meets Gradient Descent
Ghai, Udaya, Hazan, Elad, Singer, Yoram
The (stochastic) gradient descent and the multiplicative update method are probably the most popular algorithms in machine learning. We introduce and study a new regularization which provides a unification of the additive and multiplicative updates. This regularization is derived from an hyperbolic analogue of the entropy function, which we call hypentropy. It is motivated by a natural extension of the multiplicative update to negative numbers. The hypentropy has a natural spectral counterpart which we use to derive a family of matrix-based updates that bridge gradient methods and the multiplicative method for matrices. While the latter is only applicable to positive semi-definite matrices, the spectral hypentropy method can naturally be used with general rectangular matrices. We analyze the new family of updates by deriving tight regret bounds. We study empirically the applicability of the new update for settings such as multiclass learning, in which the parameters constitute a general rectangular matrix.
Distribution-Dependent Analysis of Gibbs-ERM Principle
Kuzborskij, Ilja, Cesa-Bianchi, Nicolò, Szepesvári, Csaba
Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as Stochastic Gradient Langevin Dynamics and ---to some extent--- Stochastic Gradient Descent), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data-generating distribution and learning in large hypothesis spaces. Our main results are distribution-dependent upper bounds on several notions of excess risk. We show that, in all cases, the distribution-dependent excess risk is essentially controlled by the effective dimension $\mathrm{tr}\left(\boldsymbol{H}^{\star} (\boldsymbol{H}^{\star} + \lambda \boldsymbol{I})^{-1}\right)$ of the problem, where $\boldsymbol{H}^{\star}$ is the Hessian matrix of the risk at a local minimum. This is a well-established notion of effective dimension appearing in several previous works, including the analyses of SGD and ridge regression, but ours is the first work that brings this dimension to the analysis of learning using Gibbs densities. The distribution-dependent view we advocate here improves upon earlier results of Raginsky et al. (2017), and can yield much tighter bounds depending on the interplay between the data-generating distribution and the loss function. The first part of our analysis focuses on the localized excess risk in the vicinity of a fixed local minimizer. This result is then extended to bounds on the global excess risk, by characterizing probabilities of local minima (and their complement) under Gibbs densities, a results which might be of independent interest.
Total stochastic gradient algorithms and applications in reinforcement learning
Backpropagation and the chain rule of derivatives have been prominent; however, the total derivative rule has not enjoyed the same amount of attention. In this work we show how the total derivative rule leads to an intuitive visual framework for creating gradient estimators on graphical models. In particular, previous "policy gradient theorems" are easily derived. We derive new gradient estimators based on density estimation, as well as a likelihood ratio gradient, which "jumps" to an intermediate node, not directly to the objective function. We evaluate our methods on model-based policy gradient algorithms, achieve good performance, and present evidence towards demystifying the success of the popular PILCO algorithm.