Gradient Descent
Dual Parameterization of Sparse Variational Gaussian Processes
Adam, Vincent, Chang, Paul E., Khan, Mohammad Emtiyaz, Solin, Arno
Sparse variational Gaussian process (SVGP) methods are a common choice for non-conjugate Gaussian process inference because of their computational benefits. In this paper, we improve their computational efficiency by using a dual parameterization where each data example is assigned dual parameters, similarly to site parameters used in expectation propagation. Our dual parameterization speeds-up inference using natural gradient descent, and provides a tighter evidence lower bound for hyperparameter learning. The approach has the same memory cost as the current SVGP methods, but it is faster and more accurate.
Sharp Bounds for Federated Averaging (Local SGD) and Continuous Perspective
Glasgow, Margalit, Yuan, Honglin, Ma, Tengyu
Federated Averaging (FedAvg), also known as Local SGD, is one of the most popular algorithms in Federated Learning (FL). Despite its simplicity and popularity, the convergence rate of FedAvg has thus far been undetermined. Even under the simplest assumptions (convex, smooth, homogeneous, and bounded covariance), the best-known upper and lower bounds do not match, and it is not clear whether the existing analysis captures the capacity of the algorithm. In this work, we first resolve this question by providing a lower bound for FedAvg that matches the existing upper bound, which shows the existing FedAvg upper bound analysis is not improvable. Additionally, we establish a lower bound in a heterogeneous setting that nearly matches the existing upper bound. While our lower bounds show the limitations of FedAvg, under an additional assumption of third-order smoothness, we prove more optimistic state-of-the-art convergence results in both convex and non-convex settings. Our analysis stems from a notion we call iterate bias, which is defined by the deviation of the expectation of the SGD trajectory from the noiseless gradient descent trajectory with the same initialization. We prove novel sharp bounds on this quantity, and show intuitively how to analyze this quantity from a Stochastic Differential Equation (SDE) perspective.
Big-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple Scales
Kelner, Jonathan, Marsden, Annie, Sharan, Vatsal, Sidford, Aaron, Valiant, Gregory, Yuan, Honglin
We provide new gradient-based methods for efficiently solving a broad class of ill-conditioned optimization problems. We consider the problem of minimizing a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$ which is implicitly decomposable as the sum of $m$ unknown non-interacting smooth, strongly convex functions and provide a method which solves this problem with a number of gradient evaluations that scales (up to logarithmic factors) as the product of the square-root of the condition numbers of the components. This complexity bound (which we prove is nearly optimal) can improve almost exponentially on that of accelerated gradient methods, which grow as the square root of the condition number of $f$. Additionally, we provide efficient methods for solving stochastic, quadratic variants of this multiscale optimization problem. Rather than learn the decomposition of $f$ (which would be prohibitively expensive), our methods apply a clean recursive "Big-Step-Little-Step" interleaving of standard methods. The resulting algorithms use $\tilde{\mathcal{O}}(d m)$ space, are numerically stable, and open the door to a more fine-grained understanding of the complexity of convex optimization beyond condition number.
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks
Shevchenko, Alexander, Kungurtsev, Vyacheslav, Mondelli, Marco
Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via SGD for a univariate regularized regression problem. Our main result is that SGD is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of "knot" points - i.e., points where the tangent of the ReLU network estimator changes - between two consecutive training inputs is at most three. In particular, as the number of neurons of the network grows, the SGD dynamics is captured by the solution of a gradient flow and, at convergence, the distribution of the weights approaches the unique minimizer of a related free energy, which has a Gibbs form. Our key technical contribution consists in the analysis of the estimator resulting from this minimizer: we show that its second derivative vanishes everywhere, except at some specific locations which represent the "knot" points. We also provide empirical evidence that knots at locations distinct from the data points might occur, as predicted by our theory.
Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization
Wang, Yujia, Lin, Lu, Chen, Jinghui
Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers. While error feedback compression has been proven to be successful in reducing communication costs with stochastic gradient descent (SGD), there are much fewer attempts in building communication-efficient adaptive gradient methods with provable guarantees, which are widely used in training large-scale machine learning models. In this paper, we propose a new communication-compressed AMSGrad for distributed nonconvex optimization problem, which is provably efficient. Our proposed distributed learning framework features an effective gradient compression strategy and a worker-side model update design. We prove that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting. Experiments on various benchmarks back up our theory.
Fast Global Convergence of Policy Optimization for Constrained MDPs
Liu, Tao, Zhou, Ruida, Kalathil, Dileep, Kumar, P. R., Tian, Chao
We address the issue of safety in reinforcement learning. We pose the problem in a discounted infinite-horizon constrained Markov decision process framework. Existing results have shown that gradient-based methods are able to achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate both for the optimality gap and the constraint violation. We exhibit a natural policy gradient-based algorithm that has a faster convergence rate $\mathcal{O}(\log(T)/T)$ for both the optimality gap and the constraint violation. When Slater's condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large $T$ while maintaining the same convergence rate.
Limiting fluctuation and trajectorial stability of multilayer neural networks with mean field training
Pham, Huy Tuan, Nguyen, Phan-Minh
The mean field (MF) theory of multilayer neural networks centers around a particular infinite-width scaling, where the learning dynamics is closely tracked by the MF limit. A random fluctuation around this infinite-width limit is expected from a large-width expansion to the next order. This fluctuation has been studied only in shallow networks, where previous works employ heavily technical notions or additional formulation ideas amenable only to that case. Treatment of the multilayer case has been missing, with the chief difficulty in finding a formulation that captures the stochastic dependency across not only time but also depth. In this work, we initiate the study of the fluctuation in the case of multilayer networks, at any network depth. Leveraging on the neuronal embedding framework recently introduced by Nguyen and Pham, we systematically derive a system of dynamical equations, called the second-order MF limit, that captures the limiting fluctuation distribution. We demonstrate through the framework the complex interaction among neurons in this second-order MF limit, the stochasticity with cross-layer dependency and the nonlinear time evolution inherent in the limiting fluctuation. A limit theorem is proven to relate quantitatively this limit to the fluctuation of large-width networks. We apply the result to show a stability property of gradient descent MF training: in the large-width regime, along the training trajectory, it progressively biases towards a solution with "minimal fluctuation" (in fact, vanishing fluctuation) in the learned output function, even after the network has been initialized at or has converged (sufficiently fast) to a global optimum. This extends a similar phenomenon previously shown only for shallow networks with a squared loss in the ERM setting, to multilayer networks with a loss function that is not necessarily convex in a more general setting.
Vector Calculus for Machine Learning
To keep this post as engaging and entertaining as possible I will first introduce a brief history of Calculus and why I think it is so cool. Then, we will move on to reviewing fundamental concepts of your high school calculus such as derivative rules. Next, we will get our feet wet with vectors and matrices to make sure you are comfortable with these mathematical objects before covering partial and vector derivatives. Finally, I will conclude this post with the concept of a gradient, the intuition behind optimization with Gradient Descent and a cool implementation of calculus with Python leveraging the library SimPy. Feel free to skip any sections you like if you are comfortable with such topics. At the core, Calculus is just a very special way of thinking about large problems by splitting them into several, smaller, problems.
Gradient Descent with 'Math'
In the last blog we saw about basics of Gradient Descent and how it works.This time we will see math behind it. We are actually subtracting some part from value of parameter and updating it.We keep doing this until we get optimized value of parameter so the cost is minimum. You may be thinking that why '-' sign is used in above equation. If you look at image below, in the right side of curve slope is positive so by subtracting value from theta, we are actually getting closer to the optimal value, while on the left side the slope is negative so we are actually adding some part in value of theta and so getting closer to the optimal value. We keep updating value of theta until the change in value 0.001 (values may vary according to case).Usually we take value of learning rate as 0.01
An overview on gradient descent and its variants
The term "optimization" refers to the process of iteratively training a model to produce a maximum and minimum function evaluation to get a minimum cost function. It is crucial since it will assist us in obtaining a model with the least amount of error (as there will be discrepancies between the actual and predicted values). There are various optimization methods; in this article, we'll look at gradient descent and its three forms: batch, stochastic, and mini-batch. Note: Hyperparameter optimization is required to fine-tune the model. Before you begin training the model, you must first specify hyperparameters.