Goto

Collaborating Authors

 Gradient Descent


Analysis of gradient descent methods with non-diminishing, bounded errors

arXiv.org Machine Learning

The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gradient errors that do not necessarily vanish, asymptotically. In particular, sufficient conditions are presented for both stability (almost sure boundedness of the iterates) and convergence of GD with bounded, (possibly) non-diminishing gradient errors. In addition to ensuring stability, such an algorithm is shown to converge to a small neighborhood of the minimum set, which depends on the gradient errors. It is worth noting that the main result of this paper can be used to show that GD with asymptotically vanishing errors indeed converges to the minimum set. The results presented herein are not only more general when compared to previous results, but our analysis of GD with errors is new to the literature to the best of our knowledge. Our work extends the contributions of Mangasarian & Solodov, Bertsekas & Tsitsiklis and Tadic & Doucet. Using our framework, a simple yet effective implementation of GD using simultaneous perturbation stochastic approximations (SP SA), with constant sensitivity parameters, is presented. Another important improvement over many previous results is that there are no `additional' restrictions imposed on the step-sizes. In machine learning applications where step-sizes are related to learning rates, our assumptions, unlike those of other papers, do not affect these learning rates. Finally, we present experimental results to validate our theory.


Riemannian stochastic quasi-Newton algorithm with variance reduction and its convergence analysis

arXiv.org Machine Learning

Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large, but finite number of loss functions. The present paper proposes a Riemannian stochastic quasi-Newton algorithm with variance reduction (R-SQN-VR). The key challenges of averaging, adding, and subtracting multiple gradients are addressed with notions of retraction and vector transport. We present convergence analyses of R-SQN-VR on both non-convex and retraction-convex functions under retraction and vector transport operators. The proposed algorithm is evaluated on the Karcher mean computation on the symmetric positive-definite manifold and the low-rank matrix completion on the Grassmann manifold. In all cases, the proposed algorithm outperforms the state-of-the-art Riemannian batch and stochastic gradient algorithms.


The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent

arXiv.org Machine Learning

Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradient Descent (SGD), have been extremely successful in training neural networks with strong generalization properties. In the work of Keskar et. al (2017), an SB method's success in training neural networks was attributed to the fact it converges to flat minima---those minima whose Hessian has only small eigenvalues---while a large-batch (LB) method converges to sharp minima---those minima whose Hessian has a few large eigenvalues. Commonly, this difference is attributed to the noisier gradients in SB methods that allow SB iterates to escape from sharp minima. While this explanation is intuitive, in this work we offer an alternative mechanism. In this work, we argue that SGD escapes from or converges to minima based on a deterministic relationship between the learning rate, the batch size, and the local geometry of the minimizer. We derive the exact relationships by a rigorous mathematical analysis of the canonical quadratic sums problem. Then, we numerically study how these relationships extend to nonconvex, stochastic optimization problems. As a consequence of this work, we offer a more complete explanation of why SB methods prefer flat minima and LB methods seem agnostic, which can be leveraged to design SB and LB training methods that have tailored optimization properties.


Normalized Direction-preserving Adam

arXiv.org Machine Learning

Optimization algorithms for training deep models not only affects the convergence rate and stability of the training process, but are also highly related to the generalization performance of the models. While adaptive algorithms, such as Adam and RMSprop, have shown better optimization performance than stochastic gradient descent (SGD) in many scenarios, they often lead to worse generalization performance than SGD, when used for training deep neural networks (DNNs). In this work, we identify two problems of Adam that may degrade the generalization performance. As a solution, we propose the normalized direction-preserving Adam (ND-Adam) algorithm, which combines the best of both worlds, i.e., the good optimization performance of Adam, and the good generalization performance of SGD. In addition, we further improve the generalization performance in classification tasks, by using batch-normalized softmax. This study suggests the need for more precise control over the training process of DNNs.


Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

arXiv.org Machine Learning

Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart? Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.


Parle: parallelizing stochastic gradient descent

arXiv.org Machine Learning

We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been shown to lead to improved generalization error for deep networks. Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed implementations.


Variable Annealing Length and Parallelism in Simulated Annealing

arXiv.org Artificial Intelligence

In this paper, we propose: (a) a restart schedule for an adaptive simulated annealer, and (b) parallel simulated annealing, with an adaptive and parameter-free annealing schedule. The foundation of our approach is the Modified Lam annealing schedule, which adaptively controls the temperature parameter to track a theoretically ideal rate of acceptance of neighboring states. A sequential implementation of Modified Lam simulated annealing is almost parameter-free. However, it requires prior knowledge of the annealing length. We eliminate this parameter using restarts, with an exponentially increasing schedule of annealing lengths. We then extend this restart schedule to parallel implementation, executing several Modified Lam simulated annealers in parallel, with varying initial annealing lengths, and our proposed parallel annealing length schedule. To validate our approach, we conduct experiments on an NP-Hard scheduling problem with sequence-dependent setup constraints. We compare our approach to fixed length restarts, both sequentially and in parallel. Our results show that our approach can achieve substantial performance gains, throughout the course of the run, demonstrating our approach to be an effective anytime algorithm.


Stochastic Gradient Descent: Going As Fast As Possible But Not Faster

arXiv.org Machine Learning

When applied to training deep neural networks, stochastic gradient descent (SGD) often incurs steady progression phases, interrupted by catastrophic episodes in which loss and gradient norm explode. A possible mitigation of such events is to slow down the learning process. This paper presents a novel approach to control the SGD learning rate, that uses two statistical tests. The first one, aimed at fast learning, compares the momentum of the normalized gradient vectors to that of random unit vectors and accordingly gracefully increases or decreases the learning rate. The second one is a change point detection test, aimed at the detection of catastrophic learning episodes; upon its triggering the learning rate is instantly halved. Both abilities of speeding up and slowing down the learning rate allows the proposed approach, called SALeRA, to learn as fast as possible but not faster. Experiments on standard benchmarks show that SALeRA performs well in practice, and compares favorably to the state of the art.


A Convergence Analysis for A Class of Practical Variance-Reduction Stochastic Gradient MCMC

arXiv.org Machine Learning

Stochastic gradient Markov Chain Monte Carlo (SG-MCMC) has been developed as a flexible family of scalable Bayesian sampling algorithms. However, there has been little theoretical analysis of the impact of minibatch size to the algorithm's convergence rate. In this paper, we prove that under a limited computational budget/time, a larger minibatch size leads to a faster decrease of the mean squared error bound (thus the fastest one corresponds to using full gradients), which motivates the necessity of variance reduction in SG-MCMC. Consequently, by borrowing ideas from stochastic optimization, we propose a practical variance-reduction technique for SG-MCMC, that is efficient in both computation and storage. We develop theory to prove that our algorithm induces a faster convergence rate than standard SG-MCMC. A number of large-scale experiments, ranging from Bayesian learning of logistic regression to deep neural networks, validate the theory and demonstrate the superiority of the proposed variance-reduction SG-MCMC framework.


Contouring learning rate to optimize neural nets

#artificialintelligence

Check out Siddha Ganju's talk on embedded deep learning at the Artificial Intelligence Conference in San Francisco, Sept. 17-20, 2017. Learning rate is the rate at which the accumulation of information in a neural network progresses over time. The learning rate determines how quickly (and whether at all) the network reaches the optimum, most conducive location in the network for the specific output desired. In plain Stochastic Gradient Descent (SGD), the learning rate is not related to the shape of the error gradient because a global learning rate is used, which is independent of the error gradient. However, there are many modifications that can be made to the original SGD update rule that relates the learning rate to the magnitude and orientation of the error gradient.