Gradient Descent
Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT
Anastasiou, Andreas, Balasubramanian, Krishnakumar, Erdogdu, Murat A.
We provide non-asymptotic convergence rates of the Polyak-Ruppert averaged stochastic gradient descent (SGD) to a normal random vector for a class of twice-differentiable test functions. A crucial intermediate step is proving a non-asymptotic martingale central limit theorem (CLT), i.e., establishing the rates of convergence of a multivariate martingale difference sequence to a normal random vector, which might be of independent interest. We obtain the explicit rates for the multivariate martingale CLT using a combination of Stein's method and Lindeberg's argument, which is then used in conjunction with a non-asymptotic analysis of averaged SGD proposed in [PJ92]. Our results have potentially interesting consequences for computing confidence intervals for parameter estimation with SGD and constructing hypothesis tests with SGD that are valid in a non-asymptotic sense.
A Stochastic Interpretation of Stochastic Mirror Descent: Risk-Sensitive Optimality
Stochastic mirror descent (SMD) is a fairly new family of algorithms that has recently found a wide range of applications in optimization, machine learning, and control. It can be considered a generalization of the classical stochastic gradient algorithm (SGD), where instead of updating the weight vector along the negative direction of the stochastic gradient, the update is performed in a "mirror domain" defined by the gradient of a (strictly convex) potential function. This potential function, and the mirror domain it yields, provides considerable flexibility in the algorithm compared to SGD. While many properties of SMD have already been obtained in the literature, in this paper we exhibit a new interpretation of SMD, namely that it is a risk-sensitive optimal estimator when the unknown weight vector and additive noise are non-Gaussian and belong to the exponential family of distributions. The analysis also suggests a modified version of SMD, which we refer to as symmetric SMD (SSMD). The proofs rely on some simple properties of Bregman divergence, which allow us to extend results from quadratics and Gaussians to certain convex functions and exponential families in a rather seamless way.
Convergence of the ADAM algorithm from a Dynamical System Viewpoint
Barakat, Anas, Bianchi, Pascal
Adam is a popular variant of the stochastic gradient descent for finding a local minimizer of a function. The objective function is unknown but a random estimate of the current gradient vector is observed at each round of the algorithm. This paper investigates the dynamical behavior of Adam when the objective function is non-convex and differentiable. We introduce a continuous-time version of Adam, under the form of a non-autonomous ordinary differential equation (ODE). The existence and the uniqueness of the solution are established, as well as the convergence of the solution towards the stationary points of the objective function. It is also proved that the continuous-time system is a relevant approximation of the Adam iterates, in the sense that the interpolated Adam process converges weakly to the solution to the ODE.
Interplay Between Optimization and Generalization of Stochastic Gradient Descent with Covariance Noise
Wen, Yeming, Luk, Kevin, Gazeau, Maxime, Zhang, Guodong, Chan, Harris, Ba, Jimmy
The choice of batch-size in a stochastic optimization algorithm plays a substantial role for both optimization and generalization. Increasing the batch-size used typically improves optimization but degrades generalization. To address the problem of improving generalization while maintaining optimal convergence in large-batch training, we propose to add covariance noise to the gradients. We demonstrate that the optimization performance of our method is more accurately captured by the structure of the noise covariance matrix rather than by the variance of gradients. Moreover, over the convex-quadratic, we prove in theory that it can be characterized by the Frobenius norm of the noise matrix. Our empirical studies with standard deep learning model-architectures and datasets shows that our method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated.
Exponentially convergent stochastic k-PCA without variance reduction
We show, both theoretically and empirically, that the algorithm naturally adapts to data low-rankness and converges exponentially fast to the ground-truth principal subspace. Notably, our result suggests that despite various recent efforts to accelerate the convergence of stochastic-gradient based methods by adding a O(n)-time variance reduction step, for the k-PCA problem, a truly online SGD variant suffices to achieve exponential convergence on intrinsically low-rank data.
Convergence rates for the stochastic gradient descent method for non-convex objective functions
Fehrman, Benjamin, Gess, Benjamin, Jentzen, Arnulf
We prove the local convergence to minima and estimates on the rate of convergence for the stochastic gradient descent method in the case of not necessarily globally convex nor contracting objective functions. In particular, the results are applicable to simple objective functions arising in machine learning.
Benchmarking Approximate Inference Methods for Neural Structured Prediction
Exact structured inference with neural network scoring functions is computationally challenging but several methods have been proposed for approximating inference. One approach is to perform gradient descent with respect to the output structure directly (Belanger and McCallum, 2016). Another approach, proposed recently, is to train a neural network (an "inference network") to perform inference (Tu and Gimpel, 2018). In this paper, we compare these two families of inference methods on three sequence labeling datasets. We choose sequence labeling because it permits us to use exact inference as a benchmark in terms of speed, accuracy, and search error. Across datasets, we demonstrate that inference networks achieve a better speed/accuracy/search error trade-off than gradient descent, while also being faster than exact inference at similar accuracy levels. We find further benefit by combining inference networks and gradient descent, using the former to provide a warm start for the latter.
On the Power and Limitations of Random Features for Understanding Neural Networks
Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks. We then argue that despite the impressive positive results, random feature approaches are also inherently limited in what they can explain. In particular, we rigorously show that random features cannot be used to learn even a single ReLU neuron with standard Gaussian inputs, unless the network size (or magnitude of the weights) is exponentially large. Since a single neuron is learnable with gradient-based methods, we conclude that we are still far from a satisfying general explanation for the empirical success of neural networks.
Gradient Descent Intuition -- How Machines Learn
Gradient descent is an algorithm to find the minimum of a function. In machine learning, the difference between the predicted and the actual value is called error. All error, across all datapoints, when added together is known as the cost. Naturally, we would want to minimize the function that represents this cost -- Cost Function. When the cost function cannot be differentiated easily Gradient Descent does the job for us.
On the Stability and Generalization of Learning with Kernel Activation Functions
Cirillo, Michele, Scardapane, Simone, Van Vaerenbergh, Steven, Uncini, Aurelio
In this brief we investigate the generalization properties of a recently-proposed class of non-parametric activation functions, the kernel activation functions (KAFs). KAFs introduce additional parameters in the learning process in order to adapt nonlinearities individually on a per-neuron basis, exploiting a cheap kernel expansion of every activation value. While this increase in flexibility has been shown to provide significant improvements in practice, a theoretical proof for its generalization capability has not been addressed yet in the literature. Here, we leverage recent literature on the stability properties of non-convex models trained via stochastic gradient descent (SGD). By indirectly proving two key smoothness properties of the models under consideration, we prove that neural networks endowed with KAFs generalize well when trained with SGD for a finite number of steps. Interestingly, our analysis provides a guideline for selecting one of the hyper-parameters of the model, the bandwidth of the scalar Gaussian kernel. A short experimental evaluation validates the proof.