Goto

Collaborating Authors

 Gradient Descent


Dyna: A Method of Momentum for Stochastic Optimization

arXiv.org Machine Learning

An algorithm is presented for momentum gradient descent optimization based on the first-order differential equation of the Newtonian dynamics. The fictitious mass is introduced to the dynamics of momentum for regularizing the adaptive stepsize of each individual parameter. The dynamic relaxation is adapted for stochastic optimization of nonlinear objective functions through an explicit time integration with varying damping ratio. The adaptive stepsize is optimized for each individual neural network layer based on the number of inputs. The adaptive stepsize for every parameter over the entire neural network is uniformly optimized with one upper bound, independent of sparsity, for better overall convergence rate. The numerical implementation of the algorithm is similar to the Adam Optimizer, possessing computational efficiency, similar memory requirements, etc. There are three hyper-parameters in the algorithm with clear physical interpretation. Preliminary trials show promise in performance and convergence.


Predictive Uncertainty in Large Scale Classification using Dropout - Stochastic Gradient Hamiltonian Monte Carlo

arXiv.org Machine Learning

Abstract--Predictive uncertainty is crucial for many computer vision tasks, from image classification to autonomous driving systems. Hamiltonian Monte Carlo (HMC) is an inference method for sampling complex posterior distributions. On the other hand, Dropout regularization has been proposed as an approximate model averaging technique that tends to improve generalization in large scale models such as deep neural networks. Although, HMC provides convergence guarantees for most standard Bayesian models, it does not handle discrete parameters arising from Dropout regularization. In this paper, we present a robust methodology for predictive uncertainty in large scale classification problems, based on Dropout and Stochastic Gradient Hamiltonian Monte Carlo. Even though Dropout induces a non-smooth energy function with no such convergence guarantees, the resulting discretization of the Hamiltonian proves empirical success. The proposed method allows to effectively estimate predictive accuracy and to provide better generalization for difficult test examples.


Randomized Smoothing SVRG for Large-scale Nonsmooth Convex Optimization

arXiv.org Machine Learning

In this paper, we consider the problem of minimizing the average of a large number of nonsmooth and convex functions. Such problems often arise in typical machine learning problems as empirical risk minimization, but are computationally very challenging. We develop and analyze a new algorithm that achieves robust linear convergence rate, and both its time complexity and gradient complexity are superior than state-of-art nonsmooth algorithms and subgradient-based schemes. Besides, our algorithm works without any extra error bound conditions on the objective function as well as the common strongly-convex condition. We show that our algorithm has wide applications in optimization and machine learning problems, and demonstrate experimentally that it performs well on a large-scale ranking problem.


Scaling limit of the Stein variational gradient descent part I: the mean field regime

arXiv.org Machine Learning

We study an interacting particle system in $\mathbf{R}^d$ motivated by Stein variational gradient descent [Q. Liu and D. Wang, NIPS 2016], a deterministic algorithm for sampling from a given probability density with unknown normalization. We prove that in the large particle limit the empirical measure converges to a solution of a non-local and nonlinear PDE. We also prove global well-posedness and uniqueness of the solution to the limiting PDE. Finally, we prove that the solution to the PDE converges to the unique invariant solution in large time limit.


Metatrace: Online Step-size Tuning by Meta-gradient Descent for Reinforcement Learning Control

arXiv.org Artificial Intelligence

Reinforcement learning (RL) has had many successes in both "deep" and "shallow" settings. In both cases, significant hyperparameter tuning is often required to achieve good performance. Furthermore, when nonlinear function approximation is used, non-stationarity in the state representation can lead to learning instability. A variety of techniques exist to combat this --- most notably large experience replay buffers or the use of multiple parallel actors. These techniques come at the cost of moving away from the online RL problem as it is traditionally formulated (i.e., a single agent learning online without maintaining a large database of training examples). Meta-learning can potentially help with both these issues by tuning hyperparameters online and allowing the algorithm to more robustly adjust to non-stationarity in a problem. This paper applies meta-gradient descent to derive a set of step-size tuning algorithms specifically for online RL control with eligibility traces. Our novel technique, Metatrace, makes use of an eligibility trace analogous to methods like $TD(\lambda)$. We explore tuning both a single scalar step-size and a separate step-size for each learned parameter. We evaluate Metatrace first for control with linear function approximation in the classic mountain car problem and then in a noisy, non-stationary version. Finally, we apply Metatrace for control with nonlinear function approximation in 5 games in the Arcade Learning Environment where we explore how it impacts learning speed and robustness to initial step-size choice. Results show that the meta-step-size parameter of Metatrace is easy to set, Metatrace can speed learning, and Metatrace can allow an RL algorithm to deal with non-stationarity in the learning task.


Differential Equations for Modeling Asynchronous Algorithms

arXiv.org Machine Learning

Asynchronous stochastic gradient descent (ASGD) is a popular parallel optimization algorithm in machine learning. Most theoretical analysis on ASGD take a discrete view and prove upper bounds for their convergence rates. However, the discrete view has its intrinsic limitations: there is no characterization of the optimization path and the proof techniques are induction-based and thus usually complicated. Inspired by the recent successful adoptions of stochastic differential equations (SDE) to the theoretical analysis of SGD, in this paper, we study the continuous approximation of ASGD by using stochastic differential delay equations (SDDE). We introduce the approximation method and study the approximation error. Then we conduct theoretical analysis on the convergence rates of ASGD algorithm based on the continuous approximation. There are two methods: moment estimation and energy function minimization can be used to analyze the convergence rates. Moment estimation depends on the specific form of the loss function, while energy function minimization only leverages the convex property of the loss function, and does not depend on its specific form. In addition to the convergence analysis, the continuous view also helps us derive better convergence rates. All of this clearly shows the advantage of taking the continuous view in gradient descent algorithms.


Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks

arXiv.org Machine Learning

We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs by training a neural network with a single hidden layer of nonlinear gates. Our main finding is that GD starting from a randomly initialized network converges in mean squared loss to the minimum error (in 2-norm) of the best approximation of the target function using a polynomial of degree at most $k$. Moreover, the size of the network and number of iterations needed are both bounded by $n^{O(k)}$. The core of our analysis is the following existence theorem, which is of independent interest: for any $\epsilon > 0$, any bounded function that has a degree-$k$ polynomial approximation with error $\epsilon_0$ (in 2-norm), can be approximated to within error $\epsilon_0 + \epsilon$ as a linear combination of $n^{O(k)} \mbox{poly}(1/\epsilon)$ randomly chosen gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to $k$. In particular, this applies to training networks of unbiased sigmoids and ReLUs.


Wavelet Decomposition of Gradient Boosting

arXiv.org Machine Learning

In this paper we introduce a significant improvement to the popular tree-based Stochastic Gradient Boosting algorithm using a wavelet decomposition of the trees. This approach is based on harmonic analysis and approximation theoretical elements, and as we show through extensive experimentation, our wavelet based method generally outperforms existing methods, particularly in difficult scenarios of class unbalance and mislabeling in the training data.


AI researchers allege that machine learning is alchemy

#artificialintelligence

Gradient descent relies on trial and error to optimize an algorithm, aiming for minima in a 3D landscape. Ali Rahimi, a researcher in artificial intelligence (AI) at Google in San Francisco, California, took a swipe at his field last December--and received a 40-second ovation for it. Speaking at an AI conference, Rahimi charged that machine learning algorithms, in which computers learn through trial and error, have become a form of "alchemy." Researchers, he said, do not know why some algorithms work and others don't, nor do they have rigorous criteria for choosing one AI architecture over another. Now, in a paper presented on 30 April at the International Conference on Learning Representations in Vancouver, Canada, Rahimi and his collaborators document examples of what they see as the alchemy problem and offer prescriptions for bolstering AI's rigor.


Training Data for Machine Learning Algorithms Done Right!

#artificialintelligence

There are lots of differences between traditional statistical modeling and machine learning. Machine learning depends on computer algorithms, few estimates, and small to large datasets to produce predictions of high accuracy. Statistical modeling, on the other hand, depends on human talents, mathematical equations, and many assumptions to predict the "best estimate." The amount of training data needed to build a comprehensive machine learning model is highly debated. The type of data determines the number of features.