Gradient Descent
Faster SGD training by minibatch persistency
Fischetti, Matteo, Mandatelli, Iacopo, Salvagnin, Domenico
It is well known that, for most datasets, the use of large-size minibatches for Stochastic Gradient Descent (SGD) typically leads to slow convergence and poor generalization. On the other hand, large minibatches are of great practical interest as they allow for a better exploitation of modern GPUs. Previous literature on the subject concentrated on how to adjust the main SGD parameters (in particular, the learning rate) when using large minibatches. In this work we introduce an additional feature, that we call minibatch persistency, that consists in reusing the same minibatch for K consecutive SGD iterations. The computational conjecture here is that a large minibatch contains a significant sample of the training set, so one can afford to slightly overfitting it without worsening generalization too much. The approach is intended to speedup SGD convergence, and also has the advantage of reducing the overhead related to data loading on the internal GPU memory. We present computational results on CIFAR-10 with an AlexNet architecture, showing that even small persistency values (K=2 or 5) already lead to a significantly faster convergence and to a comparable (or even better) generalization than the standard "disposable minibatch" approach (K=1), in particular when large minibatches are used. The lesson learned is that minibatch persistency can be a simple yet effective way to deal with large minibatches.
Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo
Havasi, Marton, Hernรกndez-Lobato, Josรฉ Miguel, Murillo-Fuentes, Juan Josรฉ
Deep Gaussian Processes (DGPs) are hierarchical generalizations of Gaussian Processes that combine well calibrated uncertainty estimates with the high flexibility of multilayer models. One of the biggest challenges with these models is that exact inference is intractable. The current state-of-the-art inference method, Variational Inference (VI), employs a Gaussian approximation to the posterior distribution. This can be a potentially poor unimodal approximation of the generally multimodal posterior. In this work, we provide evidence for the non-Gaussian nature of the posterior and we apply the Stochastic Gradient Hamiltonian Monte Carlo method to directly sample from it. To efficiently optimize the hyperparameters, we introduce the Moving Window MCEM algorithm. This results in significantly better predictions at a lower computational cost than its VI counterpart. Thus our method establishes a new state-of-the-art for inference in DGPs.
Laplacian Smoothing Gradient Descent
Osher, Stanley, Wang, Bao, Yin, Penghang, Luo, Xiyang, Pham, Minh, Lin, Alex
We propose a very simple modification of gradient descent and stochastic gradient descent. We show that when applied to a variety of machine learning models including softmax regression, convolutional neural nets, generative adversarial nets, and deep reinforcement learning, this very simple surrogate can dramatically reduce the variance and improve the accuracy of the generalization. The new algorithm, (which depends on one nonnegative parameter) when applied to non-convex minimization, tends to avoid sharp local minima. Instead it seeks somewhat flatter local (and often global) minima. The method only involves preconditioning the gradient by the inverse of a tri-diagonal matrix that is positive definite. The motivation comes from the theory of Hamilton-Jacobi partial differential equations. This theory demonstrates that the new algorithm is almost the same as doing gradient descent on a new function which (a) has the same global minima as the original function and (b) is "more convex". Again, the programming effort in doing this is minimal, in cost, complexity and effort. We implement our algorithm into both PyTorch and Tensorflow platforms, which will be made publicly available.
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
Nitanda, Atsushi, Suzuki, Taiji
We consider stochastic gradient descent for binary classification problems in a reproducing kernel Hilbert space. In traditional analysis, it is known that the expected classification error converges more slowly than the expected risk even when assuming a low-noise condition on the conditional label probabilities. Consequently, the resulting rate is sublinear. Therefore, it is important to consider whether much faster convergence of the expected classification error can be achieved. In recent research, an exponential convergence rate for stochastic gradient descent was shown under a strong low-noise condition, but theoretical analysis of this was limited to the square loss function, which is somewhat inadequate for binary classification tasks. In this paper, we show an exponential convergence rate of the expected classification error in the final phase of learning for a wide class of differentiable convex loss functions under similar assumptions.
PAC-Bayes Control: Synthesizing Controllers that Provably Generalize to Novel Environments
Majumdar, Anirudha, Goldstein, Maxwell
Our goal is to synthesize controllers for robots that provably generalize well to novel environments given a dataset of example environments. The key technical idea behind our approach is to leverage tools from generalization theory in machine learning by exploiting a precise analogy (which we present in the form of a reduction) between robustness of controllers to novel environments and generalization of hypotheses in supervised learning. In particular, we utilize the Probably Approximately Correct (PAC)-Bayes framework, which allows us to obtain upper bounds (that hold with high probability) on the expected cost of (stochastic) controllers across novel environments. We propose control synthesis algorithms that explicitly seek to minimize this upper bound. The corresponding optimization problem can be solved using convex optimization (Relative Entropy Programming in particular) in the setting where we are optimizing over a finite control policy space. In the more general setting of continuously parameterized controllers, we minimize this upper bound using stochastic gradient descent. We present examples of our approach in the context of obstacle avoidance control with depth measurements. Our simulated examples demonstrate the potential of our approach to provide strong generalization guarantees on controllers for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, and rich sensory inputs (e.g., depth measurements).
Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning
Yin, Dong, Chen, Yudong, Ramchandran, Kannan, Bartlett, Peter
In this paper, we study robust large-scale distributed learning in the presence of saddle points in non-convex loss functions. We consider the Byzantine setting where some worker machines may have abnormal or even arbitrary and adversarial behavior. We argue that in the Byzantine setting, optimizing a non-convex function and escaping saddle points become much more challenging, even when robust gradient estimators are used. We develop ByzantinePGD, a robust and communication-efficient algorithm that can provably escape saddle points and converge to approximate local minimizers. The iteration complexity of our algorithm in the Byzantine setting matches that of standard gradient descent in the usual setting. We further provide three robust aggregation subroutines that can be used in ByzantinePGD, including median, trimmed mean, and iterative filtering. We characterize their performance in statistical settings, and argue for their near-optimality in different regimes including the high dimensional setting.
INFERNO: Inference-Aware Neural Optimisation
de Castro, Pablo, Dorigo, Tommaso
Complex computer simulations are commonly required for accurate data modelling in many scientific disciplines, making statistical inference challenging due to the intractability of the likelihood evaluation for the observed data. Furthermore, sometimes one is interested on inference drawn over a subset of the generative model parameters while taking into account model uncertainty or misspecification on the remaining nuisance parameters. In this work, we show how non-linear summary statistics can be constructed by minimising inference-motivated losses via stochastic gradient descent.
Convergence of SGD in Learning ReLU Models with Separable Data
Xu, Tengyu, Zhou, Yi, Ji, Kaiyi, Liang, Yingbin
We consider the binary classification problem in which the objective function is the exponential loss with a ReLU model, and study the convergence property of the stochastic gradient descent (SGD) algorithm on linearly separable data. We show that the gradient descent (GD) algorithm do not always learn desirable model parameters due to the nonlinear ReLU model. Then, we identify a certain condition of data samples, under which we show that SGD can learn a proper classifier with implicit bias. In specific, we establish the sub-linear convergence rate of the function value generated by SGD to global minimum. We further show that SGD actually converges in expectation to the maximum margin classifier with respect to the samples with +1 label under the ReLU model at the rate O(1/ln t). We also extend our study to the case of multi-ReLU neurons, and show that SGD converges to a certain non-linear maximum margin classifier for a class of non-linearly separable data.
Meta-Learning for Stochastic Gradient MCMC
Gong, Wenbo, Li, Yingzhen, Hernรกndez-Lobato, Josรฉ Miguel
Stochastic gradient Markov chain Monte Carlo (SG-MCMC) has become increasingly popular for simulating posterior samples in large-scale Bayesian modeling. However, existing SG-MCMC schemes are not tailored to any specific probabilistic model, even a simple modification of the underlying dynamical system requires significant physical intuition. This paper presents the first meta-learning algorithm that allows automated design for the underlying continuous dynamics of an SG-MCMC sampler. The learned sampler generalizes Hamiltonian dynamics with state-dependent drift and diffusion, enabling fast traversal and efficient exploration of neural network energy landscapes. Experiments validate the proposed approach on both Bayesian fully connected neural network and Bayesian recurrent neural network tasks, showing that the learned sampler out-performs generic, hand-designed SG-MCMC algorithms, and generalizes to different datasets and larger architectures.
Auto-Meta: Automated Gradient Based Meta Learner Search
Kim, Jaehong, Choi, Youngduck, Cha, Moonsu, Lee, Jung Kwon, Lee, Sangyeul, Kim, Sungwan, Choi, Yongseok, Kim, Jiwon
Fully automating machine learning pipeline is one of the outstanding challenges of general artificial intelligence, as practical machine learning often requires costly human driven process, such as hyper-parameter tuning, algorithmic selection, and model selection. In this work, we consider the problem of executing automated, yet scalable search for finding optimal gradient based meta-learners in practice. As a solution, we apply progressive neural architecture search to proto-architectures by appealing to the model agnostic nature of general gradient based meta learners. In the presence of recent universality result of Finn \textit{et al.}\cite{finn:universality_maml:DBLP:/journals/corr/abs-1710-11622}, our search is a priori motivated in that neural network architecture search dynamics---automated or not---may be quite different from that of the classical setting with the same target tasks, due to the presence of the gradient update operator. A posteriori, our search algorithm, given appropriately designed search spaces, finds gradient based meta learners with non-intuitive proto-architectures that are narrowly deep, unlike the inception-like structures previously observed in the resulting architectures of traditional NAS algorithms. Along with these notable findings, the searched gradient based meta-learner achieves state-of-the-art results on the few shot classification problem on Mini-ImageNet with $76.29\%$ accuracy, which is an $13.18\%$ improvement over results reported in the original MAML paper. To our best knowledge, this work is the first successful AutoML implementation in the context of meta learning.