Gradient Descent
Convergence Problems with Generative Adversarial Networks (GANs)
Generative adversarial networks (GANs) are a novel approach to generative modelling, a task whose goal it is to learn a distribution of real data points. They have often proved difficult to train: GANs are unlike many techniques in machine learning, in that they are best described as a two-player game between a discriminator and generator. This has yielded both unreliability in the training process, and a general lack of understanding as to how GANs converge, and if so, to what. The purpose of this dissertation is to provide an account of the theory of GANs suitable for the mathematician, highlighting both positive and negative results. This involves identifying the problems when training GANs, and how topological and game-theoretic perspectives of GANs have contributed to our understanding and improved our techniques in recent years.
Direct Acceleration of SAGA using Sampled Negative Momentum
We focus on achieving very high accuracy for Problem (1), although for practical optimization tasks, such as supervised learning, low empirical risk may result in high generalization error. In this paper, we treat Problem (1) as a pure optimization problem. When F(ยท) in Problem (1) is strongly convex, traditional analysis shows that gradient descent (GD) yields a fast linear convergence rate but with a high per-iteration cost, and thus may not be suitable for problems with a very large n. As an alternative for large problems, SGD [Robbins and Monro, 1951] uses only one or a mini-batch of gradients in each iteration, and thus enjoys significantly lower per-iteration complexity than GD. However, due to the variance of gradient estimator, vanilla SGD is shown to yield only a sub-linear convergence rate. Recently, stochastic variance reduced methods (e.g., SAG [Roux et al., 2012], SVRG [Johnson and Zhang, 2013], SAGA [Defazio et al., 2014], and their proximal variants, such as [Schmidt et al., 2017], [Xiao and Zhang, 2014] and [Koneฤnรฝ et al., 2016]) were proposed to solve Problem (1). All these methods are equipped with various variance reduction techniques, which help them achieve low per-iteration complexities comparable with SGD and at the same time maintain the fast linear convergence rate of GD.
Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data
Veitch, Victor, Austern, Morgane, Zhou, Wenda, Blei, David M., Orbanz, Peter
Empirical risk minimization is the principal tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent advances in graph sampling theory. We (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this risk that are automatically unbiased. The key ingredient is to consider the method by which data is sampled from a graph as an explicit component of model design. Theoretical results establish that the choice of sampling scheme is critical. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we are able to solve the risk minimization in a plug-and-play fashion even on large datasets. We demonstrate empirically that relational ERM models achieve state-of-the-art results on semi-supervised node classification tasks. The experiments also confirm the importance of the choice of sampling scheme.
Modular meta-learning
Alet, Ferran, Lozano-Pรฉrez, Tomรกs, Kaelbling, Leslie P.
In many situations, such as robot-learning, training experience is very expensive. One strategy for reducing the amount of training data needed for a new task is to learn some form of prior or bias using data from several related tasks. The objective of this process is to extract information that will substantially reduce the training-data requirements for a new task. This problem is a form of transfer learning, sometimes also called meta-learning or "learning to learn" [1, 2]. Previous approaches to meta-learning for robotics have focused on finding distributions over [3] or initial values of [4, 5] parameters, based on a set of "training tasks," that will enable a new "test task" to be learned with many fewer training examples. Our objective is similar, but rather than focusing on transferring information about parameter values, we focus on finding a reusable set of modules that can form components of a solution to a new task, possibly with a small amount of tuning. Modular approaches to learning have been very successful in structured tasks such as naturallanguage sentence interpretation [6], in which the input signal gives relatively direct information about a good structural decomposition of the problem. We wish to address problems that may benefit from a modular decomposition but do not provide any task-level input from which the structure of a solution may be derived. Nonetheless, we adopt a similar modular structure and parameteradaptation method for learning our reusable modules, but use a general-purpose simulated-annealing search strategy to find an appropriate structural decomposition for each new task.
Random Shuffling Beats SGD after Finite Epochs
HaoChen, Jeffery Z., Sra, Suvrit
A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings.
A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates
Arjevani, Yossi, Shamir, Ohad, Srebro, Nathan
We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $\tau$ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/\tau$ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions.
Multi-Merge Budget Maintenance for Stochastic Gradient Descent SVM Training
Qaadan, Sahar, Glasmachers, Tobias
Budgeted Stochastic Gradient Descent (BSGD) is a state-of-the-art technique for training large-scale kernelized support vector machines. The budget constraint is maintained incrementally by merging two points whenever the pre-defined budget is exceeded. The process of finding suitable merge partners is costly; it can account for up to 45% of the total training time. In this paper we investigate computationally more efficient schemes that merge more than two points at once. We obtain significant speed-ups without sacrificing accuracy.
Speeding Up Budgeted Stochastic Gradient Descent SVM Training with Precomputed Golden Section Search
Glasmachers, Tobias, Qaadan, Sahar
Limiting the model size of a kernel support vector machine to a pre-defined budget is a well-established technique that allows to scale SVM learning and prediction to large-scale data. Its core addition to simple stochastic gradient training is budget maintenance through merging of support vectors. This requires solving an inner optimization problem with an iterative method many times per gradient step. In this paper we replace the iterative procedure with a fast lookup. We manage to reduce the merging time by up to 65% and the total training time by 44% without any loss of accuracy.
Stochastic natural gradient descent draws posterior samples in function space
Smith, Samuel L., Duckworth, Daniel, Le, Quoc V., Sohl-Dickstein, Jascha
Natural gradient descent (NGD) minimises the cost function on a Riemannian manifold whose metric is defined by the Fisher information. In this work, we prove that if the model predictions on the training set approach the true conditional distribution of labels given inputs, then the noise inherent in minibatch gradients causes the stationary distribution of NGD to approach a Bayesian posterior, whose temperature $T \approx \epsilon N/(2B)$ is controlled by the learning rate $\epsilon$, training set size $N$ and batch size $B$. The parameter-dependence of the Fisher metric introduces an implicit prior over the parameters, which we identify as the well-known Jeffreys prior. To support our claims, we show that the distribution of samples from NGD is close to the Laplace approximation to the posterior when $T = 1$. Furthermore, the test loss of ensembles drawn using NGD falls rapidly as we increase the batch size until $B \approx \epsilon N/2$, while above this point the test loss is constant or rises slowly.
Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution
This paper presents a theoretical study of gradient boosted trees (GBT: Friedman, 2001). Machine learning methods for prediction have generally been thought of as trading off both intelligibility and statistical uncertainty quantification in favor of accuracy. Recent results have started to provide a statistical understanding of methods based on ensembles of decision trees (Breiman et al., 1984). In particular, the consistency of methods related to Random Forests (RFs: Breiman, 2001) has been demonstrated in Biau (2012); Scornet et al. (2015) while Wager et al. (2014); Mentch and Hooker (2016); Wager and Athey (2017) and Athey et al. (2016) prove central limit theorems for RF predictions. These have then been used for tests of variable importance and nonparametric interactions in Mentch and Hooker (2017). In this paper, we extend this analysis to GBT. Analyses of RFs have relied on a subsampling structure to express the estimator in the form of a U-statistic from which central limit theorems can be derived. By contrast, GBT produces trees sequentially with the current tree depending on the values in those built previously, requiring a different analytical approach. While the algorithm proposed in Friedman (2001) is intended to be generally applicable to any loss function, in this paper we focus specifically on nonparametric regression (Stone, 1977, 1982).