Gradient Descent
An Alternative Surrogate Loss for PGD-based Adversarial Testing
Gowal, Sven, Uesato, Jonathan, Qin, Chongli, Huang, Po-Sen, Mann, Timothy, Kohli, Pushmeet
Adversarial testing methods based on Projected Gradient Descent (PGD) are widely used for searching norm-bounded perturbations that cause the inputs of neural networks to be misclassified. This paper takes a deeper look at these methods and explains the effect of different hyperparameters (i.e., optimizer, step size and surrogate loss). We introduce the concept of MultiTargeted testing, which makes clever use of alternative surrogate losses, and explain when and how MultiTargeted is guaranteed to find optimal perturbations. Finally, we demonstrate that MultiTargeted outperforms more sophisticated methods and often requires less iterative steps than other variants of PGD found in the literature. Notably, MultiTargeted ranks first on MadryLab's white-box MNIST and CIFAR-10 leaderboards, reducing the accuracy of their MNIST model to 88.36% (with $\ell_\infty$ perturbations of $\epsilon = 0.3$) and the accuracy of their CIFAR-10 model to 44.03% (at $\epsilon = 8/255$). MultiTargeted also ranks first on the TRADES leaderboard reducing the accuracy of their CIFAR-10 model to 53.07% (with $\ell_\infty$ perturbations of $\epsilon = 0.031$).
On the Convergence of Perturbed Distributed Asynchronous Stochastic Gradient Descent to Second Order Stationary Points in Non-convex Optimization
Wang, Lifu, Shen, Bo, Zhao, Ning
In this paper, the second order convergence of non-convex optimization in the asynchronous stochastic gradient descent (ASGD) algorithm is studied systematically. We investigate the behavior of ASGD near and away from saddle points. Different from the general stochastic gradient descent(SGD), we show that ASGD might return back even if it has escaped the saddle points, yet after staying near a strict saddle point for a long enough time ($O(T)$), ASGD will finally go away from strict saddle points. An inequality is given to describe the process of ASGD to escape saddle points. Using a novel Razumikhin-Lyapunov method, we show the exponential instability of the perturbed gradient dynamics near the strict saddle points and give a more detailed estimation about how the time delay parameter $T$ influences the speed to escape. In particular, we consider the optimization of smooth nonconvex functions, and propose a perturbed asynchronous stochastic gradient descent algorithm with guarantee of convergence to second order stationary points with high probability in $O(1/\epsilon^4)$ iterations. To the best of our knowledge, this is the first work on the second order convergence of asynchronous algorithm.
Fast Exact Matrix Completion: A Unifying Optimization Framework
Bertsimas, Dimitris, Li, Michael Lingzhi
We consider the problem of matrix completion of rank $k$ on an $n\times m$ matrix. We show that both the general case and the case with side information can be formulated as a combinatorical problem of selecting $k$ vectors from $p$ column features. We demonstrate that it is equivalent to a separable optimization problem that is amenable to stochastic gradient descent. We design fastImpute, based on projected stochastic gradient descent, to enable efficient scaling of the algorithm of sizes of $10^5 \times 10^5$. We report experiments on both synthetic and real-world datasets that show fastImpute is competitive in both the accuracy of the matrix recovered and the time needed across all cases. Furthermore, when a high number of entries are missing, fastImpute is over $75\%$ lower in MAPE and $10$x faster than current state-of-the-art matrix completion methods in both the case with side information and without.
First-Order Preconditioning via Hypergradient Descent
Moskovitz, Ted, Wang, Rui, Lan, Janice, Kapoor, Sanyam, Miconi, Thomas, Yosinski, Jason, Rawal, Aditya
A BSTRACT Standard gradient descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space. These difficulties can be addressed by second-order approaches that apply a preconditioning matrix to the gradient to improve convergence. Unfortunately, such algorithms typically struggle to scale to high-dimensional problems, in part because the calculation of specific preconditioners such as the inverse Hessian or Fisher information matrix is highly expensive. We introduce first-order preconditioning (FOP), a fast, scalable approach that generalizes previous work on hyper-gradient descent (Almeida et al., 1998; Maclaurin et al., 2015; Baydin et al., 2017) to learn a preconditioning matrix that only makes use of first-order information. Experiments show that FOP is able to improve the performance of standard deep learning optimizers on several visual classification tasks with minimal computational overhead. We also investigate the properties of the learned preconditioning matrices and perform a preliminary theoretical analysis of the algorithm. Despite this, deep neural networks and other large-scale machine learning models applied to such problems typically rely on simple variations of gradient descent to train, which is known to be highly sensitive to these difficulties.
On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation
Kumar, Harshat, Koppel, Alec, Ribeiro, Alejandro
Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold for in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle, which suggests that learning more slowly may lead to improved limit points, providing insight into the interplay between optimization and generalization in reinforcement learning.
Linear Regression with Gradient Descent from Scratch in Numpy
I strongly advise you to read the article linked above. It will set the foundations on the topic, plus some math is already discussed there. To start out, I'll define my dataset -- only three points that are in a linear relationship. I've chosen so few points only because the math will be shorter -- needless to say, the math won't be more complex for longer dataset, it would just be longer, and I don't want to make some stupid arithmetic mistake. Then I'll set coefficients beta 0 and beta 1 to some constant and define the cost function as Sum of Squared Residuals (SSR/SSE).
Demystifying Different Variants of Gradient Descent Optimization Algorithm
In this post, we have looked at the batch gradient descent, the need to develop new optimization techniques, and then we briefly discussed how to interpret contour plots. After that, we have looked at behind six different optimization techniques and three different data strategies (batch, mini-batch & stochastic) with an intuitive understanding which helps to know where to use any of these algorithms. In Practice Adam optimizer with mini-batch of sizes 32, 64 and 128 is the default choice, at least for all the image classification tasks which deal with CNN and large sequence to sequence models.
Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network
A BSTRACT To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). First, we prove that when the gradient is zero (or bounded above by a small constant) at every data point in training, a situation called interpolation setting, there exists many-to-one alignment between student and teacher nodes in the lowest layer under mild conditions. This suggests that generalization in unseen dataset is achievable, even the same condition often leads to zero training error. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are learned first and subtle teacher nodes are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-parameterization plays two roles: (1) it is a necessary condition for alignment to happen at the critical points, and (2) in training dynamics, it helps student nodes cover more teacher nodes with fewer iterations. Although networks with even one-hidden layer can fit any function (Hornik et al., 1989), it remains an open question how such networks can generalize to new data. Different from what traditional machine learning theory predicts, empirical evidence (Zhang et al., 2017) shows more parameters in neural network lead to better generalization. How over-parameterization yields strong generalization is an important question for understanding how deep learning works. In this paper, we analyze multi-layer ReLU networks by adopting teacher-student setting. The fixed teacher network provides the output for the student to learn via SGD. The student is over-parameterized (or over-realized): it has more nodes than the teacher. Therefore, there exists student weights whose gradient at every data point is zero. Here, we want to study the inverse problem: With small gradient at every training sample, can the student weights recover the teachers'? If so, then the generalization performance can be guaranteed if the training converges to such critical points. In this paper, we show that this so-called interpolation setting (Ma et al., 2017; Liu & Belkin, 2018; Bassily et al., 2018) leads to alignment: under certain conditions, each teacher node is provably aligned with at least one student node in the lowest layer. The condition is simply that the teacher node is observed by at least one student node, i.e., teacher's ReLU boundary lies in the activation region of that student. Therefore, more over-parameterization increases the probability of teachers being observed and thus being aligned. Furthermore, in 2-layer case, those student nodes that are not aligned with any teacher have zero contribution to the output and can be pruned.
Mirror Descent View for Neural Network Quantization
Ajanthan, Thalaiyasingam, Gupta, Kartik, Torr, Philip H. S., Hartley, Richard, Dokania, Puneet K.
Quantizing large Neural Networks (NN) while maintaining the performance is highly desirable for resource-limited devices due to reduced memory and time complexity. NN quantization is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. In this work, by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework (Bubeck (2015)) for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which would enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we discuss a numerically stable implementation of MD by storing an additional set of auxiliary dual variables (continuous). This update is strikingly analogous to the popular Straight Through Estimator (STE) based method which is typically viewed as a "trick" to avoid vanishing gradients issue but here we show that it is an implementation method for MD for certain projections. Our experiments on standard classification datasets (CIFAR-10/100, TinyImageNet) with convolutional and residual architectures show that our MD variants obtain fully-quantized networks with accuracies very close to the floating-point networks.
Improving the convergence of SGD through adaptive batch sizes
Sievert, Scott, Charles, Zachary
Mini-batch stochastic gradient descent (SGD) approximates the gradient of an objective function with the average gradient of some batch of constant size. While small batch sizes can yield high-variance gradient estimates that prevent the model from learning a good model, large batches may require more data and computational effort. This work presents a method to change the batch size adaptively with model quality. We show that our method requires the same number of model updates as full-batch gradient descent while requiring the same total number of gradient computations as SGD. While this method requires evaluating the objective function, we present a passive approximation that eliminates this constraint and improves computational efficiency. We provide extensive experiments illustrating that our methods require far fewer model updates without increasing the total amount of computation.