Goto

Collaborating Authors

 Gradient Descent


Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval

arXiv.org Machine Learning

In recent literature, a general two step procedure has been formulated for solving the problem of phase retrieval. First, a spectral technique is used to obtain a constant-error initial estimate, following which, the estimate is refined to arbitrary precision by first-order optimization of a non-convex loss function. Numerical experiments, however, seem to suggest that simply running the iterative schemes from a random initialization may also lead to convergence, albeit at the cost of slightly higher sample complexity. In this paper, we prove that, in fact, constant step size online stochastic gradient descent (SGD) converges from arbitrary initializations for the non-smooth, non-convex amplitude squared loss objective. In this setting, online SGD is also equivalent to the randomized Kaczmarz algorithm from numerical analysis. Our analysis can easily be generalized to other single index models. It also makes use of new ideas from stochastic process theory, including the notion of a summary state space, which we believe will be of use for the broader field of non-convex optimization.


On the Global Convergence of (Fast) Incremental Expectation Maximization Methods

arXiv.org Machine Learning

The EM algorithm is one of the most popular algorithm for inference in latent data models. The original formulation of the EM algorithm does not scale to large data set, because the whole data set is required at each iteration of the algorithm. To alleviate this problem, Neal and Hinton have proposed an incremental version of the EM (iEM) in which at each iteration the conditional expectation of the latent data (E-step) is updated only for a mini-batch of observations. Another approach has been proposed by Capp\'e and Moulines in which the E-step is replaced by a stochastic approximation step, closely related to stochastic gradient. In this paper, we analyze incremental and stochastic version of the EM algorithm as well as the variance reduced-version of Chen et. al. in a common unifying framework. We also introduce a new version incremental version, inspired by the SAGA algorithm by Defazio et. al. We establish non-asymptotic convergence bounds for global convergence. Numerical applications are presented in this article to illustrate our findings.


Splitting Steepest Descent for Growing Neural Architectures

arXiv.org Machine Learning

We develop a progressive training approach for neural networks which adaptively grows the network structure by splitting existing neurons to multiple off-springs. By leveraging a functional steepest descent idea, we derive a simple criterion for deciding the best subset of neurons to split and a splitting gradient for optimally updating the off-springs. Theoretically, our splitting strategy is a second-order functional steepest descent for escaping saddle points in an $\infty$-Wasserstein metric space, on which the standard parametric gradient descent is a first-order steepest descent. Our method provides a new computationally efficient approach for optimizing neural network structures, especially for learning lightweight neural architectures in resource-constrained settings.


Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

arXiv.org Machine Learning

Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et al. (2019a), which allowed studying performance of infinitely wide nets on datasets like CIFAR-10. However, super-quadratic running time of kernel methods makes them best suited for small-data tasks. We report results suggesting neural tangent kernels perform strongly on low-data tasks. 1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets. 2. On CIFAR-10 with 10 - 640 training samples, Convolutional NTK consistently beats ResNet-34 by 1% - 3%. 3. On VOC07 testbed for few-shot image classification tasks on ImageNet with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance. 4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a). NTK's efficacy may trace to lower variance of output.


PopSGD: Decentralized Stochastic Gradient Descent in the Population Model

arXiv.org Machine Learning

The population model is a standard way to represent large-scale decentralized distributed systems, in which agents with limited computational power interact in randomly chosen pairs, in order to collectively solve global computational tasks. In contrast with synchronous gossip models, nodes are anonymous, lack a common notion of time, and have no control over their scheduling. In this paper, we examine whether large-scale distributed optimization can be performed in this extremely restrictive setting. We introduce and analyze a natural decentralized variant of stochastic gradient descent (SGD), called PopSGD, in which every node maintains a local parameter, and is able to compute stochastic gradients with respect to this parameter. Every pair-wise node interaction performs a stochastic gradient step at each agent, followed by averaging of the two models. We prove that, under standard assumptions, SGD can converge even in this extremely loose, decentralized setting, for both convex and non-convex objectives. Moreover, surprisingly, in the former case, the algorithm can achieve linear speedup in the number of nodes $n$. Our analysis leverages a new technical connection between decentralized SGD and randomized load-balancing, which enables us to tightly bound the concentration of node parameters. We validate our analysis through experiments, showing that PopSGD can achieve convergence and speedup for large-scale distributed learning tasks in a supercomputing environment.


A geometric interpretation of stochastic gradient descent using diffusion metrics

arXiv.org Machine Learning

Stochastic gradient descent (SGD) is a key ingredient in the training of deep neural networks and yet its geometrical significance appears elusive. We study a deterministic model in which the trajectories of our dynamical systems are described via geodesics of a family of metrics arising from the diffusion matrix. These metrics encode information about the highly non-isotropic gradient noise in SGD. We establish a parallel with General Relativity models, where the role of the electromagnetic field is played by the gradient of the loss function. We compute an example of a two layer network.


Communication Efficient Decentralized Training with Multiple Local Updates

arXiv.org Machine Learning

Decentralized optimization has been demonstrated to be very useful in machine learning. This work studies the communication-efficiency issue in decentralized optimization. We analyze the Periodic Decentralized Stochastic Gradient Descent (PD-SGD) algorithm, a straightforward combination of federated averaging and decentralized SGD. For the setting of for non-convex objective and non-identically distributed data, we prove that PD-SGD converges to a critical point. In particular, the number of local SGDs trades off communication and local computation. From an algorithmic perspective, we analyze a novel version of PD-SGD, which alternates between multiple local updates and multiple decentralized SGDs. We also show that when we periodically shrink the length of local updates, this generalized PD-SGD can better balance the communication-convergence trade-off both theoretically and empirically.


Path Length Bounds for Gradient Descent

#artificialintelligence

Figure 1: A two-dimensional convex function represented via contour lines. The function value is constant on the boundary of each such ellipse, and decreases as the ellipse becomes smaller and smaller. Let us assume we want to minimize this function starting from a point \(A\). The red line shows the path followed by a gradient descent optimizer converging to the minimum point \(B\), while the green dashed line represents the direct line joining \(A\) and \(B\). In today's post, we will discuss an interesting property concerning the trajectory of gradient descent iterates, namely the length of the Gradient Descent curve.



Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization

arXiv.org Machine Learning

Two types of zeroth-order stochastic algorithms have recently been designed for nonconvex optimization respectively based on the first-order techniques SVRG and SARAH/SPIDER. This paper addresses several important issues that are still open in these methods. First, all existing SVRG-type zeroth-order algorithms suffer from worse function query complexities than either zeroth-order gradient descent (ZO-GD) or stochastic gradient descent (ZO-SGD). In this paper, we propose a new algorithm ZO-SVRG-Coord-Rand and develop a new analysis for an existing ZO-SVRG-Coord algorithm proposed in Liu et al. 2018b, and show that both ZO-SVRG-Coord-Rand and ZO-SVRG-Coord (under our new analysis) outperform other exiting SVRG-type zeroth-order methods as well as ZO-GD and ZO-SGD. Second, the existing SPIDER-type algorithm SPIDER-SZO (Fang et al. 2018) has superior theoretical performance, but suffers from the generation of a large number of Gaussian random variables as well as a $\sqrt{\epsilon}$-level stepsize in practice. In this paper, we develop a new algorithm ZO-SPIDER-Coord, which is free from Gaussian variable generation and allows a large constant stepsize while maintaining the same convergence rate and query complexity, and we further show that ZO-SPIDER-Coord automatically achieves a linear convergence rate as the iterate enters into a local PL region without restart and algorithmic modification.