Gradient Descent
Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers
Hanna, Serge Kas, Bitar, Rawad, Parag, Parimal, Dasari, Venkat, Rouayheb, Salim El
We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k
Towards an Efficient and General Framework of Robust Training for Graph Neural Networks
Xu, Kaidi, Liu, Sijia, Chen, Pin-Yu, Sun, Mengshu, Ding, Caiwen, Kailkhura, Bhavya, Lin, Xue
Graph Neural Networks (GNNs) have made significant advances on several fundamental inference tasks. As a result, there is a surge of interest in using these models for making potentially important decisions in high-regret applications. However, despite GNNs' impressive performance, it has been observed that carefully crafted perturbations on graph structures (or nodes attributes) lead them to make wrong predictions. Presence of these adversarial examples raises serious security concerns. Most of the existing robust GNN design/training methods are only applicable to white-box settings where model parameters are known and gradient based methods can be used by performing convex relaxation of the discrete graph domain. More importantly, these methods are not efficient and scalable which make them infeasible in time sensitive tasks and massive graph datasets. To overcome these limitations, we propose a general framework which leverages the greedy search algorithms and zeroth-order methods to obtain robust GNNs in a generic and an efficient manner. On several applications, we show that the proposed techniques are significantly less computationally expensive and, in some cases, more robust than the state-of-the-art methods making them suitable to large-scale problems which were out of the reach of traditional robust training methods.
General Framework for Binary Classification on Top Samples
Adam, Lukรกลก, Mรกcha, Vรกclav, ล mรญdl, Vรกclav, Pevnรฝ, Tomรกลก
Many binary classification problems minimize misclassification above (or below) a threshold. We show that instances of ranking problems, accuracy at the top or hypothesis testing may be written in this form. We propose a general framework to handle these classes of problems and show which known methods (both known and newly proposed) fall into this framework. We provide a theoretical analysis of this framework and mention selected possible pitfalls the methods may encounter. We suggest several numerical improvements including the implicit derivative and stochastic gradient descent. We provide an extensive numerical study. Based both on the theoretical properties and numerical experiments, we conclude the paper by suggesting which method should be used in which situation.
Biased Stochastic Gradient Descent for Conditional Stochastic Optimization
Hu, Yifan, Zhang, Siqi, Chen, Xin, He, Niao
Conditional Stochastic Optimization (CSO) covers a variety of applications ranging from meta-learning and causal inference to invariant learning. However, constructing unbiased gradient estimates in CSO is challenging due to the composition structure. As an alternative, we propose a biased stochastic gradient descent (BSGD) algorithm and study the bias-variance tradeoff under different structural assumptions. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives, under smooth and non-smooth conditions. We also provide matching lower bounds of BSGD for convex CSO objectives. Extensive numerical experiments are conducted to illustrate the performance of BSGD on robust logistic regression, model-agnostic meta-learning (MAML), and instrumental variable regression (IV).
The Early Phase of Neural Network Training
Frankle, Jonathan, Schwab, David J., Morcos, Ari S.
Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here, we examine the changes that deep neural networks undergo during this early phase of training. We perform extensive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations. Despite this behavior, pre-training with blurred inputs or an auxiliary self-supervised task can approximate the changes in supervised networks, suggesting that these changes are not inherently label-dependent, though labels significantly accelerate this process. Together, these results help to elucidate the network changes occurring during this pivotal initial period of learning.
Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization
An open question in the Deep Learning community is why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data. We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent that we call Coherent Gradients: Gradients from similar examples are similar and so the overall gradient is stronger in certain directions where these reinforce each other. Thus changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples when such similarity exists. We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive. It suggests a natural modification to gradient descent that can greatly reduce overfitting.
Statistical Adaptive Stochastic Gradient Methods
Zhang, Pengchuan, Lang, Hunter, Liu, Qiang, Xiao, Lin
We propose a statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in stochastic gradient methods. SALSA first uses a smoothed stochastic line-search procedure to gradually increase the learning rate, then automatically switches to a statistical method to decrease the learning rate. The line search procedure ``warms up'' the optimization process, reducing the need for expensive trial and error in setting an initial learning rate. The method for decreasing the learning rate is based on a new statistical test for detecting stationarity when using a constant step size. Unlike in prior work, our test applies to a broad class of stochastic gradient algorithms without modification. The combined method is highly robust and autonomous, and it matches the performance of the best hand-tuned learning rate schedules in our experiments on several deep learning tasks.
Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
Wang, Bao, Nguyen, Tan M., Bertozzi, Andrea L., Baraniuk, Richard G., Osher, Stanley J.
Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimization using a specially designed momentum; however, it accumulates error when an inexact gradient is used (such as in SGD), slowing convergence at best and diverging at worst. In this paper, we propose Scheduled Restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance in training ResNet200 for ImageNet classification, SRSGD achieves an error rate of 20.93% vs. the benchmark of 22.13%. These improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with fewer training epochs compared to the SGD baseline. We provide code for SRSGD at https://github.com/minhtannguyen/SRSGD.
Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence
Loizou, Nicolas, Vaswani, Sharan, Laradji, Issam, Lacoste-Julien, Simon
We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models.
Interpolating Between Gradient Descent and Exponentiated Gradient Using Reparameterized Gradient Descent
Amid, Ehsan, Warmuth, Manfred K.
Continuous-time mirror descent (CMD) can be seen as the limit case of the discrete-time MD update when the step-size is infinitesimally small. In this paper, we focus on the geometry of the primal and dual CMD updates and introduce a general framework for reparameterizing one CMD update as another. Specifically, the reparameterized update also corresponds to a CMD, but on the composite loss w.r.t. the new variables, and the original variables are obtained via the reparameterization map. We employ these results to introduce a new family of reparameterizations that interpolate between the two commonly used updates, namely the continuous-time gradient descent (GD) and unnormalized exponentiated gradient (EGU), while extending to many other well-known updates. In particular, we show that for the underdetermined linear regression problem, these updates generalize the known behavior of GD and EGU, and provably converge to the minimum $\mathrm{L}_{2-\tau}$-norm solution for $\tau\in[0,1]$. Our new results also have implications for the regularized training of neural networks to induce sparsity.