Goto

Collaborating Authors

 Gradient Descent


Decentralised Learning with Random Features and Distributed Gradient Descent

arXiv.org Machine Learning

We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in the homogenous setting where a network of agents are given data sampled independently from the same unknown distribution. Along with reducing the memory footprint, Random Features are particularly convenient in this setting as they provide a common parameterisation across agents that allows to overcome previous difficulties in implementing Decentralised Kernel Regression. Under standard source and capacity assumptions, we establish high probability bounds on the predictive performance for each agent as a function of the step size, number of iterations, inverse spectral gap of the communication matrix and number of Random Features. By tuning these parameters, we obtain statistical rates that are minimax optimal with respect to the total number of samples in the network. The algorithm provides a linear improvement over single machine Gradient Descent in memory cost and, when agents hold enough data with respect to the network size and inverse spectral gap, a linear speed-up in computational runtime for any network topology. We present simulations that show how the number of Random Features, iterations and samples impact predictive performance.


AdaSGD: Bridging the gap between SGD and Adam

arXiv.org Machine Learning

In the context of stochastic gradient descent(SGD) and adaptive moment estimation (Adam),researchers have recently proposed optimization techniques that transition from Adam to SGD with the goal of improving both convergence and generalization performance. However, precisely how each approach trades off early progress and generalization is not well understood; thus, it is unclear when or even if, one should transition from one approach to the other. In this work, by first studying the convex setting, we identify potential contributors to observed differences in performance between SGD and Adam. In particular,we provide theoretical insights for when and why Adam outperforms SGD and vice versa. We ad-dress the performance gap by adapting a single global learning rate for SGD, which we refer to as AdaSGD. We justify this proposed approach with empirical analyses in non-convex settings. On several datasets that span three different domains,we demonstrate how AdaSGD combines the benefits of both SGD and Adam, eliminating the need for approaches that transition from Adam to SGD.


Sliced Kernelized Stein Discrepancy

arXiv.org Machine Learning

Kernelized Stein discrepancy (KSD), though being extensively used in goodness-of-fit tests and model learning, suffers from the curse-of-dimensionality. We address this issue by proposing the sliced Stein discrepancy and its scalable and kernelized variants, which employs kernel-based test functions defined on the optimal onedimensional projections instead of the full input in high dimensions. When applied to goodness-of-fit tests, extensive experiments show the proposed discrepancy significantly outperforms KSD and various baselines in high dimensions. For model learning, we show its advantages by training an independent component analysis when compared with existing Stein discrepancy baselines. We further propose a novel particle inference method called sliced Stein variational gradient descent (S-SVGD) which alleviates the mode-collapse issue of SVGD in training variational autoencoders.


Guarantees for Tuning the Step Size using a Learning-to-Learn Approach

arXiv.org Machine Learning

Learning-to-learn (using optimization algorithms to learn a new optimizer) has successfully trained efficient optimizers in practice. This approach relies on meta-gradient descent on a meta-objective based on the trajectory that the optimizer generates. However, there were few theoretical guarantees on how to avoid meta-gradient explosion/vanishing problems, or how to train an optimizer with good generalization performance. In this paper, we study the learning-to-learn approach on a simple problem of tuning the step size for quadratic loss. Our results show that although there is a way to design the meta-objective so that the meta-gradient remain polynomially bounded, computing the meta-gradient directly using backpropagation leads to numerical issues that look similar to gradient explosion/vanishing problems. We also characterize when it is necessary to compute the meta-objective on a separate validation set instead of the original training set. Finally, we verify our results empirically and show that a similar phenomenon appears even for more complicated learned optimizers parametrized by neural networks.


Optimization Landscape of Tucker Decomposition

arXiv.org Machine Learning

Tucker decomposition is a popular technique for many data analysis and machine learning applications. Finding a Tucker decomposition is a nonconvex optimization problem. As the scale of the problems increases, local search algorithms such as stochastic gradient descent have become popular in practice. In this paper, we characterize the optimization landscape of the Tucker decomposition problem. In particular, we show that if the tensor has an exact Tucker decomposition, for a standard nonconvex objective of Tucker decomposition, all local minima are also globally optimal. We also give a local search algorithm that can find an approximate local (and global) optimal solution in polynomial time.


A Multilevel Approach to Training

arXiv.org Machine Learning

We propose a novel training method based on nonlinear multilevel minimization techniques, commonly used for solving discretized large scale partial differential equations. Our multilevel training method constructs a multilevel hierarchy by reducing the number of samples. The training of the original model is then enhanced by internally training surrogate models constructed with fewer samples. We construct the surrogate models using first-order consistency approach. This gives rise to surrogate models, whose gradients are stochastic estimators of the full gradient, but with reduced variance compared to standard stochastic gradient estimators. We illustrate the convergence behavior of the proposed multilevel method to machine learning applications based on logistic regression. A comparison with subsampled Newton's and variance reduction methods demonstrate the efficiency of our multilevel method.


Gradient Based Memory Editing for Task-Free Continual Learning

arXiv.org Machine Learning

Prior work on continual learning often operate in a "task-aware" manner, by assuming that the task boundaries and identifies of the data instances are known at all times. While in practice, it is rarely the case that such information are exposed to the methods (i.e., thus called "task-free")--a setting that is relatively underexplored. Recent attempts on task-free continual learning build on previous memory replay methods and focus on developing memory management strategies such that model performance over priorly seen instances can be best retained. In this paper, looking from a complementary angle, we propose a principled approach to "edit" stored examples which aims to carry more updated information from the data stream in the memory. We use gradient updates to edit stored examples so that they are more likely to be forgotten in future updates. Experiments on five benchmark datasets show the proposed method can be seamlessly combined with baselines to significantly improve the performance. Code has been released at https://github.com/INK-USC/GMED.


On the Generalization Benefit of Noise in Stochastic Gradient Descent

arXiv.org Machine Learning

It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.


Nearest Neighbour Based Estimates of Gradients: Sharp Nonasymptotic Bounds and Applications

arXiv.org Machine Learning

Motivated by a wide variety of applications, ranging from stochastic optimization to dimension reduction through variable selection, the problem of estimating gradients accurately is of crucial importance in statistics and learning theory. We consider here the classic regression setup, where a real valued square integrable r.v. Y is to be predicted upon observing a (possibly high dimensional) random vector X by means of a predictive function f(X) as accurately as possible in the mean-squared sense and study a nearest-neighbour-based pointwise estimate of the gradient of the optimal predictive function, the regression function m(x) E[Y X x]. Under classic smoothness conditions combined with the assumption that the tails of Y m(X) are sub-Gaussian, we prove nonasymptotic bounds improving upon those obtained for alternative estimation methods. Beyond the novel theoretical results established, several illustrative numerical experiments have been carried out. The latter provide strong empirical evidence that the estimation method proposed works very well for various statistical problems involving gradient estimation, namely dimensionality reduction, stochastic gradient descent optimization and quantifying disentanglement.


Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms

arXiv.org Machine Learning

A new generation of stochastic gradient decent algorithms, namely stochastic gradient Langevin dynamics (SGLD), can be efficient in finding global minimizers of possibly complicated, highdimensional landscapes under suitable regularity assumptions for the gradient, see Raginsky et al. (2017), Welling and Teh (2011) and references therein. However, in the specific case of tuning ANNs, or simply neural networks henceforth, problems arise already at the theoretical level. As discussed in Section 4 below in some detail, the functionals to be minimized fail any form of dissipativity which should be a sine qua non for any stable gradient algorithms. Adding a quadratic regularization term cannot always remedy this, in which case one needs to replace it with a higher order penalty term. However, the addition of such a term leads to the violation of the global Lipschitz continuity for the regularized gradient, which in turn renders the use of gradient descent methods problematic as it can be seen in Figure 2. This issue has been highlighted in the case of Euler discretizations (of which SGLD is an example) in Hutzenthaler et al. (2011), where it is proven that the difference of the exact solution of the corresponding stochastic differential equation (SDE) and of the numerical approximation at even a finite time point diverges to infinity in the strong mean square sense. A natural way to address the above issue is to combine higher order regularization with taming techniques to improve the stability of any resulting algorithm. In particular, the use of taming techniques in the construction of stable numerical approximations for nonlinear SDEs has gained substantial attention in recent years and was introduced by Hutzenthaler et al. (2012) and, independently, All the authors were supported by The Alan Turing Institute, London under the EPSRC grant EP/N510129/1. A. L. and M. R. thank for the "Lendület" grant LP 2015-6 of the Hungarian Academy of Sciences.