Gradient Descent
Learning by on-line gradient descent - IOPscience
We study on-line gradient-descent learning in multilayer networks analytically and numerically. The training is based on randomly drawn inputs and their corresponding outputs as defined by a target rule. In the thermodynamic limit we derive deterministic differential equations for the order parameters of the problem which allow an exact calculation of the evolution of the generalization error. First we consider a single-layer perceptron with sigmoidal activation function learning a target rule defined by a network of the same architecture. For this model the generalization error decays exponentially with the number of training examples if the learning rate is sufficiently small.
Continual Learning with Scaled Gradient Projection
In neural networks, continual learning results in gradient interference among sequential tasks, leading to catastrophic forgetting of old tasks while learning new ones. This issue is addressed in recent methods by storing the important gradient spaces for old tasks and updating the model orthogonally during new tasks. However, such restrictive orthogonal gradient updates hamper the learning capability of the new tasks resulting in sub-optimal performance. To improve new learning while minimizing forgetting, in this paper we propose a Scaled Gradient Projection (SGP) method, where we combine the orthogonal gradient projections with scaled gradient steps along the important gradient spaces for the past tasks. The degree of gradient scaling along these spaces depends on the importance of the bases spanning them. We propose an efficient method for computing and accumulating importance of these bases using the singular value decomposition of the input representations for each task. We conduct extensive experiments ranging from continual image classification to reinforcement learning tasks and report better performance with less training overhead than the state-of-the-art approaches.
HOAX: A Hyperparameter Optimization Algorithm Explorer for Neural Networks
Thie, Albert, Menger, Maximilian F. S. J., Faraji, Shirin
Computational chemistry has become an important tool to predict and understand molecular properties and reactions. Even though recent years have seen a significant growth in new algorithms and computational methods that speed up quantum chemical calculations, the bottleneck for trajectory-based methods to study photoinduced processes is still the huge number of electronic structure calculations. In this work, we present an innovative solution, in which the amount of electronic structure calculations is drastically reduced, by employing machine learning algorithms and methods borrowed from the realm of artificial intelligence. However, applying these algorithms effectively requires finding optimal hyperparameters, which remains a challenge itself. Here we present an automated user-friendly framework, HOAX, to perform the hyperparameter optimization for neural networks, which bypasses the need for a lengthy manual process. The neural network generated potential energy surfaces (PESs) reduces the computational costs compared to the ab initio-based PESs. We perform a comparative investigation on the performance of different hyperparameter optimiziation algorithms, namely grid search, simulated annealing, genetic algorithm, and bayesian optimizer in finding the optimal hyperparameters necessary for constructing the well-performing neural network in order to fit the PESs of small organic molecules. Our results show that this automated toolkit not only facilitate a straightforward way to perform the hyperparameter optimization but also the resulting neural networks-based generated PESs are in reasonable agreement with the ab initio-based PESs.
Learning Globally Smooth Functions on Manifolds
Cervino, Juan, Chamon, Luiz F. O., Haeffele, Benjamin D., Vidal, Rene, Ribeiro, Alejandro
Smoothness and low dimensional structures play central roles in improving generalization and stability in learning and statistics. This work combines techniques from semi-infinite constrained learning and manifold regularization to learn representations that are globally smooth on a manifold. To do so, it shows that under typical conditions the problem of learning a Lipschitz continuous function on a manifold is equivalent to a dynamically weighted manifold regularization problem. This observation leads to a practical algorithm based on a weighted Laplacian penalty whose weights are adapted using stochastic gradient techniques. It is shown that under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct. Experiments on real world data illustrate the advantages of the proposed method relative to existing alternatives.
An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models
Ho, Nhat, Ren, Tongzheng, Sanghavi, Sujay, Sarkar, Purnamrita, Ward, Rachel
Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
QLAB: Quadratic Loss Approximation-Based Optimal Learning Rate for Deep Learning
We propose a learning rate adaptation scheme, called QLAB, for descent optimizers. We derive QLAB by optimizing the quadratic approximation of the loss function and QLAB can be combined with any optimizer who can provide the descent update direction. The computation of an adaptive learning rate with QLAB requires only computing an extra loss function value. We theoretically prove the convergence of the descent optimizers with QLAB. We demonstrate the effectiveness of QLAB in a range of optimization problems by combining with conclusively stochastic gradient descent, stochastic gradient descent with momentum, and Adam. The performance is validated on multi-layer neural networks, CNN, VGG-Net, ResNet and ShuffleNet with two datasets, MNIST and CIFAR10.
Surprising Instabilities in Training Deep Networks and a Theoretical Analysis
Sun, Yuxin, Lao, Dong, Sundaramoorthi, Ganesh, Yezzi, Anthony
We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD). We show numerical error (on the order of the smallest floating point bit) induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance, comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.
Training trajectories, mini-batch losses and the curious role of the learning rate
Sandler, Mark, Zhmoginov, Andrey, Vladymyrov, Max, Miller, Nolan
Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show that the loss function on a fixed batch appears to be remarkably convex-like. In particular for ResNet the loss for any fixed mini-batch can be accurately modeled by a quadratic function and a very low loss value can be reached in just one step of gradient descent with sufficiently large learning rate. We propose a simple model that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a many steps apart, significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent
Ghosh, Avrajit, Lyu, He, Zhang, Xitong, Wang, Rongrong
It is well known that the finite step-size ($h$) in Gradient Descent (GD) implicitly regularizes solutions to flatter minima. A natural question to ask is "Does the momentum parameter $\beta$ play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?". To answer this question, first, we show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. Then, we show that this implicit regularizer for (GD+M) is stronger than that of (GD) by factor of $(\frac{1+\beta}{1-\beta})$, thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to the stochastic version of gradient descent with momentum (SGD+M) and characterize the continuous trajectory of the update of (SGD+M) in a pointwise sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory.
Adapting Step-size: A Unified Perspective to Analyze and Improve Gradient-based Methods for Adversarial Attacks
Tao, Wei, Bao, Lei, Long, Sheng, Wu, Gaowei, Tao, Qing
Learning adversarial examples can be formulated as an optimization problem of maximizing the loss function with some box-constraints. However, for solving this induced optimization problem, the state-of-the-art gradient-based methods such as FGSM, I-FGSM and MI-FGSM look different from their original methods especially in updating the direction, which makes it difficult to understand them and then leaves some theoretical issues to be addressed in viewpoint of optimization. In this paper, from the perspective of adapting step-size, we provide a unified theoretical interpretation of these gradient-based adversarial learning methods. We show that each of these algorithms is in fact a specific reformulation of their original gradient methods but using the step-size rules with only current gradient information. Motivated by such analysis, we present a broad class of adaptive gradient-based algorithms based on the regular gradient methods, in which the step-size strategy utilizing information of the accumulated gradients is integrated. Such adaptive step-size strategies directly normalize the scale of the gradients rather than use some empirical operations. The important benefit is that convergence for the iterative algorithms is guaranteed and then the whole optimization process can be stabilized. The experiments demonstrate that our AdaI-FGM consistently outperforms I-FGSM and AdaMI-FGM remains competitive with MI-FGSM for black-box attacks.