Goto

Collaborating Authors

 Gradient Descent


Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

arXiv.org Machine Learning

The stochastic gradient descent (SGD) optimization algorithm plays a central role in machine learning and, in particular, deep learning applications such as image analysis and speech recognition (cf., e.g., [12, 13, 16, 23]). It is therefore important to analyze and quantify the convergence speed of the SGD method. There is a vast amount of scientific literature investigating and providing upper bounds for the SGD method and modifications of it (cf., e.g., [3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 20, 21, 24] and cf., e.g., [14] for a more comprehensive review of the literature). Much less attention has been paid to proving lower error bounds for the SGD method, that is, to quantifying the best possible speed of convergence which the SGD method can achieve (cf., e.g., [2, 17, 19, 22, 25]). It is the key contribution of this paper to make a step in this direction.


Gradient Descent Quantizes ReLU Network Features

arXiv.org Machine Learning

Deep neural networks are often trained in the over-parametrized regime (i.e. with far more parameters than training examples), and understanding why the training converges to solutions that generalize remains an open problem. Several studies have highlighted the fact that the training procedure, i.e. mini-batch Stochastic Gradient Descent (SGD) leads to solutions that have specific properties in the loss landscape. However, even with plain Gradient Descent (GD) the solutions found in the over-parametrized regime are pretty good and this phenomenon is poorly understood. We propose an analysis of this behavior for feedforward networks with a ReLU activation function under the assumption of small initialization and learning rate and uncover a quantization effect: The weight vectors tend to concentrate at a small number of directions determined by the input data. As a consequence, we show that for given input data there are only finitely many, "simple" functions that can be obtained, independent of the network size. This puts these functions in analogy to linear interpolations (for given input data there are finitely many triangulations, which each determine a function by linear interpolation). We ask whether this analogy extends to the generalization properties - while the usual distribution-independent generalization property does not hold, it could be that for e.g. smooth functions with bounded second derivative an approximation property holds which could "explain" generalization of networks (of unbounded size) to unseen inputs.


Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit

arXiv.org Machine Learning

Modern supervised learning techniques, particularly those using so called deep nets, involve fitting high dimensional labelled data sets with functions containing very large numbers of parameters. Much of this work is empirical, and interesting phenomena have been observed that require theoretical explanations, however the non-convexity of the loss functions complicates the analysis. Recently it has been proposed that some of the success of these techniques resides in the effectiveness of the simple stochastic gradient descent algorithm in the so called interpolation limit in which all labels are fit perfectly. This analysis is made possible since the SGD algorithm reduces to a stochastic linear system near the interpolating minimum of the loss function. Here we exploit this insight by analyzing a distributed algorithm for gradient descent, also in the interpolating limit. The algorithm corresponds to gradient descent applied to a simple penalized distributed loss function, $L({\bf w}_1,...,{\bf w}_n) = \Sigma_i l_i({\bf w}_i) + \mu \sum_{}|{\bf w}_i-{\bf w}_j|^2$. Here each node is allowed its own parameter vector and $$ denotes edges of a connected graph defining the communication links between nodes. It is shown that this distributed algorithm converges linearly (ie the error reduces exponentially with iteration number), with a rate $1-\frac{\eta}{n}\lambda_{min}(H)


The Implicit Bias of Gradient Descent on Separable Data

arXiv.org Machine Learning

We show that gradient descent on an unregularized logistic regression problem, for linearly separable datasets, converges to the direction of the max-margin (hard margin SVM) solution. The result generalizes also to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods.


Residual Networks: Lyapunov Stability and Convex Decomposition

arXiv.org Machine Learning

While training error of most deep neural networks degrades as the depth of the network increases, residual networks appear to be an exception. We show that the main reason for this is the Lyapunov stability of the gradient descent algorithm: for an arbitrarily chosen step size, the equilibria of the gradient descent are most likely to remain stable for the parametrization of residual networks. We then present an architecture with a pair of residual networks to approximate a large class of functions by decomposing them into a convex and a concave part. Some parameters of this model are shown to change little during training, and this imperfect optimization prevents overfitting the data and leads to solutions with small Lipschitz constants, while providing clues about the generalization of other deep networks.


SUCAG: Stochastic Unbiased Curvature-aided Gradient Method for Distributed Optimization

arXiv.org Machine Learning

We propose and analyze a new stochastic gradient method, which we call Stochastic Unbiased Curvature-aided Gra- dient (SUCAG), for finite sum optimization problems. SUCAG constitutes an unbiased total gradient tracking technique that uses Hessian information to accelerate convergence. We an- alyze our method under the general asynchronous model of computation, in which functions are selected infinitely often, but with delays that can grow sublinearly. For strongly convex problems, we establish linear convergence for the SUCAG method. When the initialization point is sufficiently close to the optimal solution, the established convergence rate is only dependent on the condition number of the problem, making it strictly faster than the known rate for the SAGA method. Furthermore, we describe a Markov-driven approach of implementing the SUCAG method in a distributed asynchronous multi-agent setting, via gossiping along a random walk on the communication graph. We show that our analysis applies as long as the undirected graph is connected and, notably, establishes an asymptotic linear convergence rate that is robust to the graph topology. Numerical results demonstrate the merit of our algorithm over existing methods.


Gradient Descent with Random Initialization: Fast Global Convergence for Nonconvex Phase Retrieval

arXiv.org Machine Learning

This paper considers the problem of solving systems of quadratic equations, namely, recovering an object of interest $\mathbf{x}^{\natural}\in\mathbb{R}^{n}$ from $m$ quadratic equations/samples $y_{i}=(\mathbf{a}_{i}^{\top}\mathbf{x}^{\natural})^{2}$, $1\leq i\leq m$. This problem, also dubbed as phase retrieval, spans multiple domains including physical sciences and machine learning. We investigate the efficiency of gradient descent (or Wirtinger flow) designed for the nonconvex least squares problem. We prove that under Gaussian designs, gradient descent --- when randomly initialized --- yields an $\epsilon$-accurate solution in $O\big(\log n+\log(1/\epsilon)\big)$ iterations given nearly minimal samples, thus achieving near-optimal computational and sample complexities at once. This provides the first global convergence guarantee concerning vanilla gradient descent for phase retrieval, without the need of (i) carefully-designed initialization, (ii) sample splitting, or (iii) sophisticated saddle-point escaping schemes. All of these are achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data.


D$^2$: Decentralized Training over Decentralized Data

arXiv.org Machine Learning

While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D$^2$, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D$^2$ is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from $O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)$ to $O\left({\sigma \over \sqrt{nT}}\right)$ where $\zeta^{2}$ denotes the variance among data on different workers. As a result, D$^2$ is robust to data variance among workers. We empirically evaluated D$^2$ on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D$^2$ significantly outperforms D-PSGD.


Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

arXiv.org Machine Learning

Over-parameterized models are crucial in deep learning, but their workings are far from understood. Over-parameterization -- the technique of using more parameters than statistically necessary -- apparently improves the training: theoretical and empirical results have suggested that it can enhance the geometric properties of the optimization landscape in simplified settings [24, 18, 17, 34] and thus make it easier to train over-parameterized models. On the other hand, over-parameterization often doesn't hurt the test performance, even if the number of parameters is much larger than the number of examples. Large neural networks used in practice have enough expressiveness to fit any labels of the training datasets [42, 17]. The training objective function may have multiple global minima with almost zero training error, some of which generalize better than the others [21, 11]. However, local improvement algorithms such as stochastic gradient descent, starting with proper initialization, may prefer some generalizable local minima to the others and thus provide an implicit effect of regularization [38, 28, 19, 27, 41]. Such regularization seems to depend on the algorithmic choice, the initialization scheme, and certain intrinsic properties of the data. The phenomenon and intuition above can be theoretically fleshed out in the context of linear models [35], whereas less is known for nonlinear models whose training objectives are usually nonconvex. The very important work of Gunasekar et al. [16] initiates the study of low-rank matrix factorization models with over-parameterization and conjectures that gradient descent prefers small trace norm solution in over-parameterized models with thorough empirical evidences.


On the role of synaptic stochasticity in training low-precision neural networks

arXiv.org Machine Learning

International Centre for Theoretical Physics, Trieste, Italy Stochasticity and limited precision of synaptic weights in neural network models are key aspects of both biological and hardware modeling of learning processes. Here we show that a neural network model with stochastic binary weights naturally gives prominence to exponentially rare dense regions of solutions with a number of desirable properties such as robustness and good generalization performance, while typical solutions are isolated and hard to find. Binary solutions of the standard perceptron problem are obtained from a simple gradient descent procedure on a set of real values parametrizing a probability distribution over the binary synapses. Both analytical and numerical results are presented. An algorithmic extension aimed at training discrete deep neural networks is also investigated. Learning can be regarded as an optimization process over the connection weights of a neural network. In nature, synaptic weights are known to be plastic, low precision and unreliable, and it is an interesting issue to understand if this stochasticity can help learning or if it is an obstacle.