Gradient Descent
Minimizing Nonconvex Population Risk from Rough Empirical Risk
Jin, Chi, Liu, Lydia T., Ge, Rong, Jordan, Michael I.
Population risk---the expectation of the loss over the sampling mechanism---is always of primary interest in machine learning. However, learning algorithms only have access to empirical risk, which is the average loss over training examples. Although the two risks are typically guaranteed to be pointwise close, for applications with nonconvex nonsmooth losses (such as modern deep networks), the effects of sampling can transform a well-behaved population risk into an empirical risk with a landscape that is problematic for optimization. The empirical risk can be nonsmooth, and it may have many additional local minima. This paper considers a general optimization framework which aims to find approximate local minima of a smooth nonconvex function $F$ (population risk) given only access to the function value of another function $f$ (empirical risk), which is pointwise close to $F$ (i.e., $\|F-f\|_{\infty} \le \nu$). We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of $f$ which is guaranteed to find an $\epsilon$-second-order stationary point if $\nu \le O(\epsilon^{1.5}/d)$, thus escaping all saddle points of $F$ and all the additional local minima introduced by $f$. We also provide an almost matching lower bound showing that our SGD-based approach achieves the optimal trade-off between $\nu$ and $\epsilon$, as well as the optimal dependence on problem dimension $d$, among all algorithms making a polynomial number of queries. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit, whose empirical risk is nonsmooth.
Stochastic Variance Reduction for Policy Gradient Estimation
Xu, Tianbing, Liu, Qiang, Peng, Jian
Recent advances in policy gradient methods and deep learning have demonstrated their applicability for complex reinforcement learning problems. However, the variance of the performance gradient estimates obtained from the simulation is often excessive, leading to poor sample efficiency. In this paper, we apply the stochastic variance reduced gradient descent (SVRG) to model-free policy gradient to significantly improve the sample-efficiency. The SVRG estimation is incorporated into a trust-region Newton conjugate gradient framework for the policy optimization. On several Mujoco tasks, our method achieves significantly better performance compared to the state-of-the-art model-free policy gradient methods in robotic continuous control such as trust region policy optimization (TRPO)
Gradient descent in Gaussian random fields as a toy model for high-dimensional optimisation in deep learning
Chouza, Mariano, Roberts, Stephen, Zohren, Stefan
Our aim is to study gradient descent in such loss functions or energy landscapes and compare it to results obtained from real high-dimensional optimization problems such as encountered in deep learning. In particular, we analyze the distribution of the improved loss function after a step of gradient descent, provide analytic expressions for the moments as well as prove asymptotic normality as the dimension of the parameter space becomes large. Moreover, we compare this with the expectation of the global minimum of the landscape obtained by means of the Euler characteristic of excursion sets. Besides complementing our analytical findings with numerical results from simulated Gaussian random fields, we also compare it to loss functions obtained from optimisation problems on synthetic and real data sets by proposing a "black box" random field toy-model for a deep neural network loss function.
Generalized Byzantine-tolerant SGD
Xie, Cong, Koyejo, Oluwasanmi, Gupta, Indranil
We propose three new robust aggregation rules for distributed synchronous Stochastic Gradient Descent (SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server (PS) architecture. We prove the Byzantine resilience properties of these aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios.
Multi-View Factorization Machines
Cao, Bokai, Zhou, Hucheng, Li, Guoqiang, Yu, Philip S.
For a learning task, data can usually be collected from different sources or be represented from multiple views. For example, laboratory results from different medical examinations are available for disease diagnosis, and each of them can only reflect the health state of a person from a particular aspect/view. Therefore, different views provide complementary information for learning tasks. An effective integration of the multi-view information is expected to facilitate the learning performance. In this paper, we propose a general predictor, named multi-view machines (MVMs), that can effectively include all the possible interactions between features from multiple views. A joint factorization is embedded for the full-order interaction parameters which allows parameter estimation under sparsity. Moreover, MVMs can work in conjunction with different loss functions for a variety of machine learning tasks. A stochastic gradient descent method is presented to learn the MVM model. We further illustrate the advantages of MVMs through comparison with other methods for multi-view classification, including support vector machines (SVMs), support tensor machines (STMs) and factorization machines (FMs).
Byzantine Stochastic Gradient Descent
Alistarh, Dan, Allen-Zhu, Zeyuan, Li, Jerry
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$-approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sampling complexity and time complexity.
The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory
Alistarh, Dan, De Sa, Christopher, Konstantinov, Nikola
Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks. Given the recent practical focus on distributed machine learning, significant work has been dedicated to the convergence properties of this algorithm under the inconsistent and noisy updates arising from execution in a distributed environment. However, surprisingly, the convergence properties of this classic algorithm in the standard shared-memory model are still not well-understood. In this work, we address this gap, and provide new convergence bounds for lock-free concurrent stochastic gradient descent, executing in the classic asynchronous shared memory model, against a strong adaptive adversary. Our results give improved upper and lower bounds on the "price of asynchrony" when executing the fundamental SGD algorithm in a concurrent setting. They show that this classic optimization tool can converge faster and with a wider range of parameters than previously known under asynchronous iterations. At the same time, we exhibit a fundamental trade-off between the maximum delay in the system and the rate at which SGD can converge, which governs the set of parameters under which this algorithm can still work efficiently.
Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates
Jentzen, Arnulf, von Wurstemberger, Philippe
The stochastic gradient descent (SGD) optimization algorithm plays a central role in machine learning and, in particular, deep learning applications such as image analysis and speech recognition (cf., e.g., [12, 13, 16, 23]). It is therefore important to analyze and quantify the convergence speed of the SGD method. There is a vast amount of scientific literature investigating and providing upper bounds for the SGD method and modifications of it (cf., e.g., [3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 20, 21, 24] and cf., e.g., [14] for a more comprehensive review of the literature). Much less attention has been paid to proving lower error bounds for the SGD method, that is, to quantifying the best possible speed of convergence which the SGD method can achieve (cf., e.g., [2, 17, 19, 22, 25]). It is the key contribution of this paper to make a step in this direction.
Gradient Descent Quantizes ReLU Network Features
Maennel, Hartmut, Bousquet, Olivier, Gelly, Sylvain
Deep neural networks are often trained in the over-parametrized regime (i.e. with far more parameters than training examples), and understanding why the training converges to solutions that generalize remains an open problem. Several studies have highlighted the fact that the training procedure, i.e. mini-batch Stochastic Gradient Descent (SGD) leads to solutions that have specific properties in the loss landscape. However, even with plain Gradient Descent (GD) the solutions found in the over-parametrized regime are pretty good and this phenomenon is poorly understood. We propose an analysis of this behavior for feedforward networks with a ReLU activation function under the assumption of small initialization and learning rate and uncover a quantization effect: The weight vectors tend to concentrate at a small number of directions determined by the input data. As a consequence, we show that for given input data there are only finitely many, "simple" functions that can be obtained, independent of the network size. This puts these functions in analogy to linear interpolations (for given input data there are finitely many triangulations, which each determine a function by linear interpolation). We ask whether this analogy extends to the generalization properties - while the usual distribution-independent generalization property does not hold, it could be that for e.g. smooth functions with bounded second derivative an approximation property holds which could "explain" generalization of networks (of unbounded size) to unseen inputs.
Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit
Modern supervised learning techniques, particularly those using so called deep nets, involve fitting high dimensional labelled data sets with functions containing very large numbers of parameters. Much of this work is empirical, and interesting phenomena have been observed that require theoretical explanations, however the non-convexity of the loss functions complicates the analysis. Recently it has been proposed that some of the success of these techniques resides in the effectiveness of the simple stochastic gradient descent algorithm in the so called interpolation limit in which all labels are fit perfectly. This analysis is made possible since the SGD algorithm reduces to a stochastic linear system near the interpolating minimum of the loss function. Here we exploit this insight by analyzing a distributed algorithm for gradient descent, also in the interpolating limit. The algorithm corresponds to gradient descent applied to a simple penalized distributed loss function, $L({\bf w}_1,...,{\bf w}_n) = \Sigma_i l_i({\bf w}_i) + \mu \sum_{}|{\bf w}_i-{\bf w}_j|^2$. Here each node is allowed its own parameter vector and $$ denotes edges of a connected graph defining the communication links between nodes. It is shown that this distributed algorithm converges linearly (ie the error reduces exponentially with iteration number), with a rate $1-\frac{\eta}{n}\lambda_{min}(H)