Gradient Descent
Partial differential equation regularization for supervised machine learning
This article is an overview of supervised machine learning problems for regression and classification. Topics include: kernel methods, training by stochastic gradient descent, deep learning architecture, losses for classification, statistical learning theory, and dimension independent generalization bounds. Implicit regularization in deep learning examples are presented, including data augmentation, adversarial training, and additive noise. These methods are re-framed as explicit gradient regularization.
Reconsidering Analytical Variational Bounds for Output Layers of Deep Networks
Sakhi, Otmane, Bonner, Stephen, Rohde, David, Vasile, Flavian
The combination of the re-parameterization trick with the use of variational auto-encoders has caused a sensation in Bayesian deep learning, allowing the training of realistic generative models of images and has considerably increased our ability to use scalable latent variable models. The re-parameterization trick is necessary for models in which no analytical variational bound is available and allows noisy gradients to be computed for arbitrary models. However, for certain standard output layers of a neural network, analytical bounds are available and the variational auto-encoder may be used both without the re-parameterization trick or the need for any Monte Carlo approximation. In this work, we show that using Jaakola and Jordan bound, we can produce a binary classification layer that allows a Bayesian output layer to be trained, using the standard stochastic gradient descent algorithm. We further demonstrate that a latent variable model utilizing the Bouchard bound for multi-class classification allows for fast training of a fully probabilistic latent factor model, even when the number of classes is very large.
Escaping Saddle Points for Zeroth-order Nonconvex Optimization using Estimated Gradient Descent
Bai, Qinbo, Agarwal, Mridul, Aggarwal, Vaneet
Gradient descent and its variants are widely used in machine learning. However, oracle access of gradient may not be available in many applications, limiting the dire ct use of gradient descent. This paper proposes a method of estimating gradient to perform gradient descent, that converges to a stationary point for general non-convex optimization problems. Beyond the first-order stati onary properties, the second-order stationary properties are important in machine learning applications to achieve b etter performance. Gradient descent and its variants (e.g., Stochastic Gradie nt Descent) are widely used in machine learning due to their favorable computational properties, for examp le, in optimizing weights of a deep neural network. Recently, second order stationary guarant ees have been studied by using a perturbed version of gradient de scent [2].
Could a strange new memory chip unlock mysteries of AI? ZDNet
Modern artificial intelligence lacks a strong theoretical basis, and so it's often a shrug of the shoulders why it works at all (or, oftentimes, doesn't entirely work). One of the deepest mysteries of deep learning is one of its most brilliant successes, what's known as stochastic gradient descent. Stochasticity, the process of randomly picking out examples of data, has yielded breakthroughs in image recognition and other deep learning tasks. And now, one computer chip company thinks they may have a kind of machine for stochasticity, a chip whose power comes from randomness. It might not lead to a theory of why machine learning works, but it might lead to knew breakthroughs in what stochasticity can achieve.
Min-Max Optimization without Gradients: Convergence and Applications to Adversarial ML
Liu, Sijia, Lu, Songtao, Chen, Xiangyi, Feng, Yao, Xu, Kaidi, Al-Dujaili, Abdullah, Hong, Minyi, Obelilly, Una-May
In this paper, we study the problem of constrained robust (min-max) optimization ina black-box setting, where the desired optimizer cannot access the gradients of the objective function but may query its values. We present a principled optimization framework, integrating a zeroth-order (ZO) gradient estimator with an alternating projected stochastic gradient descent-ascent method, where the former only requires a small number of function queries and the later needs just one-step descent/ascent update. We show that the proposed framework, referred to as ZO-Min-Max, has a sub-linear convergence rate under mild conditions and scales gracefully with problem size. From an application side, we explore a promising connection between black-box min-max optimization and black-box evasion and poisoning attacks in adversarial machine learning (ML). Our empirical evaluations on these use cases demonstrate the effectiveness of our approach and its scalability to dimensions that prohibit using recent black-box solvers.
Distributed SGD Generalizes Well Under Asynchrony
Regatti, Jayanth, Tendolkar, Gaurav, Zhou, Yi, Gupta, Abhishek, Liang, Yingbin
Jayanth Regatti Gaurav Tendolkar Yi Zhou Abhishek Gupta Yingbin Liang Abstract -- The performance of fully synchronized distributed systems has faced a bottleneck due to the big data trend, under which asynchronous distributed systems are becoming a major popularity due to their powerful scalability. In this paper, we study the generalization performance of stochastic gradient descent (SGD) on a distributed asynchronous system. The system consists of multiple worker machines that compute stochastic gradients which are further sent to and aggregated on a common parameter server to update the variables, and the communication in the system suffers from possible delays. Under the algorithm stability framework, we prove that distributed asynchronous SGD generalizes well given enough data samples in the training optimization. In particular, our results suggest to reduce the learning rate as we allow more asynchrony in the distributed system. Such adaptive learning rate strategy improves the stability of the distributed algorithm and reduces the corresponding generalization error . Then, we confirm our theoretical findings via numerical experiments. I NTRODUCTION Stochastic gradient descent (SGD) and its variants (e.g., Adagrad, Adam, etc) have been very effective in solving many challenging machine learning problems such as training deep neural networks. In practice, the solution found by SGD via solving an empirical risk minimization problem typically has good generalization performance on the test dataset.
Gradient Descent: The Ultimate Optimizer
Chandra, Kartik, Meijer, Erik, Andow, Samantha, Arroyo-Fang, Emilio, Dea, Irene, George, Johann, Grueter, Melissa, Hosmer, Basil, Stumpos, Steffi, Tempest, Alanna, Yang, Shannon
Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introduce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.
On a convergence property of a geometrical algorithm for statistical manifolds
Akaho, Shotaro, Hino, Hideitsu, Murata, Noboru
Information geometry is a framework to analyze statistical inference and machine learning[2]. Geometrically, statistical inference and many machine learning algorithms can be regarded as procedures to find a projection to a model subspace from a given data point. In this paper, we focus on an algorithm to find the projection. Since the projection is given by minimizing a divergence, a common approach to finding the projection is a gradient-based method[6]. However, such an approach is not applicable in some cases. For instance, several attempts to extend the information geometrical framework to nonparametric cases[3, 9, 13, 15], where we need to consider a function space or each data is represented as a point process. In such a case, it is difficult to compute the derivative of divergence that is necessary for gradient-based methods, and in some cases, it is difficult to deal with the coordinate explicitly. Takano et al.[15] proposed a geometrical algorithm to find the projection for nonparametric e-mixture distribution, where the model subspace is spanned by several empirical distributions. The algorithm that is derived based on the generalized Pythagorean theorem only depends on the values of divergences.
At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?
Giladi, Niv, Nacson, Mor Shpigel, Hoffer, Elad, Soudry, Daniel
Background: Recent developments have made it possible to accelerate neural networks training significantly using large batch sizes and data parallelism. Training in an asynchronous fashion, where delay occurs, can make training even more scalable. However, asynchronous training has its pitfalls, mainly a degradation in generalization, even after convergence of the algorithm. This gap remains not well understood, as theoretical analysis so far mainly focused on the convergence rate of asynchronous methods. Contributions: We examine asynchronous training from the perspective of dynamical stability. We find that the degree of delay interacts with the learning rate, to change the set of minima accessible by an asynchronous stochastic gradient descent algorithm. We derive closed-form rules on how the learning rate could be changed, while keeping the accessible set the same. Specifically, for high delay values, we find that the learning rate should be kept inversely proportional to the delay. We then extend this analysis to include momentum. We find momentum should be either turned off, or modified to improve training stability. We provide empirical experiments to validate our theoretical findings.
Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks
Despite the extensive empirical success of deep networks, their o ptimization and generalization properties are still not well understood. Recently, the neural tangent kern el (NTK) has provided the following insight into the problem. In the infinite-width limit, the NTK converges to a limit ing kernel which stays constant during training; on the other hand, when the width is large enough, t he function learned by gradient descent follows the NTK (Jacot et al., 2018). This motivates the study of ov erparameterized networks trained by gradient descent, using properties of the NTK. In fact, paramet ers related to NTK, such as the minimum eigenvalue of the limiting kernel, appear to affect optimization and gen eralization (Arora et al., 2019). However, in addition to such NTK-dependent parameters, prior wo rk also requires the width to depend polynomially on n, 1 /δ or 1 /ǫ, where n denotes the size of the training set, δ denotes the failure probability, and ǫ denotes the target error. These large widths far exceed what is u sed empirically, constituting a significant gap between theory and practice.