Gradient Descent
Efficiently testing local optimality and escaping saddles for ReLU networks
Yun, Chulhee, Sra, Suvrit, Jadbabaie, Ali
We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of M data points on the nondifferentiability of the ReLU divides the parameter space into at most 2^M regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, O(M) (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases.
Introducing Noise in Decentralized Training of Neural Networks
Adilova, Linara, Paul, Nathalie, Schlicht, Peter
It has been shown that injecting noise into the neural network weights during the training process leads to a better generalization of the resulting model. Noise injection in the distributed setup is a straightforward technique and it represents a promising approach to improve the locally trained models. We investigate the effects of noise injection into the neural networks during a decentralized training process. We show both theoretically and empirically that noise injection has no positive effect in expectation on linear models, though. However for non-linear neural networks we empirically show that noise injection substantially improves model quality helping to reach a generalization ability of a local model close to the serial baseline.
The Convergence of Sparsified Gradient Methods
Alistarh, Dan, Hoefler, Torsten, Johansson, Mikael, Khirirat, Sarit, Konstantinov, Nikola, Renggli, Cรฉdric
Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods - where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally - are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to three orders of magnitude, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.
Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks
In this note, we study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. We prove that for standard random initializations, and under mild assumptions on $f$, the number of iterations required for convergence scales exponentially with the depth $k$. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where $k$ is large.
Human-Level Intelligence or Animal-Like Abilities?
Yet the combination of these factors created a milestone in AI history, as it had a profound impact on real-world applications and the successful deployment of various AI techniques that have been in the works for a very long time, particularly neural networks.g I shared these remarks in various contexts during the course of preparing this article. The audiences ranged from AI and computer science to law and public-policy researchers with an interest in AI. What I found striking is the great interest in this discussion and the comfort, if not general agreement, with the remarks I made. I did get a few "I beg to differ" responses though, all centering on recent advancements relating to optimizing functions, which are key to the successful training of neural networks (such as results on stochastic gradient descent, dropouts, and new activation functions). The objections stemmed from not having named them as breakthroughs (in AI). My answer: They all fall under the enabler I outlined earlier: "increasingly sophisticated statistical and optimization techniques for fitting functions." Follow up question: Does it matter that they are statistical and optimization techniques, as opposed to classical AI techniques?
Learning Preconditioners on Lie Groups
We study two types of preconditioners and preconditioned stochastic gradient descent (SGD) methods in a unified framework. We call the first one the Newton type due to its close relationship to Newton method, and the second one the Fisher type as its preconditioner is closely related to the inverse of Fisher information matrix. Both preconditioners can be derived from one framework, and efficiently learned on any matrix Lie groups designated by the user using natural or relative gradient descent. Many existing preconditioners and methods are special cases of either the Newton type or the Fisher type ones. Experimental results on relatively large scale machine learning problems are reported for performance study.
Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview
Chi, Yuejie, Lu, Yue M., Chen, Yuxin
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated thinking of optimization and statistics leads to fruitful research findings.
Sparsified SGD with Memory
Stich, Sebastian U., Cordonnier, Jean-Baptiste, Jaggi, Martin
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.
Nonconvex Demixing From Bilinear Measurements
We consider the problem of demixing a sequence of source signals from the sum of noisy bilinear measurements. It is a generalized mathematical model for blind demixing with blind deconvolution, which is prevalent across the areas of dictionary learning, image processing, and communications. However, state-of- the-art convex methods for blind demixing via semidefinite programming are computationally infeasible for large-scale problems. Although the existing nonconvex algorithms are able to address the scaling issue, they normally require proper regularization to establish optimality guarantees. The additional regularization yields tedious algorithmic parameters and pessimistic convergence rates with conservative step sizes. To address the limitations of existing methods, we thus develop a provable nonconvex demixing procedure viaWirtinger flow, much like vanilla gradient descent, to harness the benefits of regularization-free fast convergence rate with aggressive step size and computational optimality guarantees. This is achieved by exploiting the benign geometry of the blind demixing problem, thereby revealing that Wirtinger flow enforces the regularization-free iterates in the region of strong convexity and qualified level of smoothness, where the step size can be chosen aggressively.
Zeroth-order (Non)-Convex Stochastic Optimization via Conditional Gradient and Gradient Updates
Balasubramanian, Krishnakumar, Ghadimi, Saeed
In this paper, we propose and analyze zeroth-order stochastic approximation algorithms for nonconvex and convex optimization. Specifically, we propose generalizations of the conditional gradient algorithm achieving rates similar to the standard stochastic gradient algorithm using only zeroth-order information. Furthermore, under a structural sparsity assumption, we first illustrate an implicit regularization phenomenon where the standard stochastic gradient algorithm with zeroth-order information adapts to the sparsity of the problem at hand by just varying the step-size. Next, we propose a truncated stochastic gradient algorithm with zeroth-order information, whose rate depends only poly-logarithmically on the dimensionality.