AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Stable Robbins-Monro approximations through stochastic proximal updates

Toulis, Panos, Horel, Thibaut, Airoldi, Edoardo M.

arXiv.org Machine LearningMar-4-2018

The need for parameter estimation with massive data has reinvigorated interest in iterative estimation procedures. Stochastic approximations, such as stochastic gradient descent, are at the forefront of this recent development because they yield simple, generic, and extremely fast iterative estimation procedures. Such stochastic approximations, however, are often numerically unstable. As a consequence, current practice has turned to proximal operators, which can induce stable parameter updates within iterations. While the majority of classical iterative estimation procedures are subsumed by the framework of Robbins and Monro (1951), there is no such generalization for stochastic approximations with proximal updates. In this paper, we conceptualize a general stochastic approximation method with proximal updates. This method can be applied even in situations where the analytical form of the objective is not known, and so it generalizes many stochastic gradient procedures with proximal operators currently in use. Our theoretical analysis indicates that the proposed method has important stability benefits over the classical stochastic approximation method. Exact instantiations of the proposed method are challenging, but we show that approximate instantiations lead to procedures that are easy to implement, and still dominate classical procedures by achieving numerical stability without tradeoffs. This last advantage is akin to that seen in deterministic proximal optimization, where the framework is typically impossible to instantiate exactly, but where approximate instantiations lead to new optimization procedures that dominate classical ones.

artificial intelligence, machine learning, procedure, (16 more...)

arXiv.org Machine Learning

1510.00967

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.77)

Add feedback

Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors

Dziugaite, Gintare Karolina, Roy, Daniel M.

arXiv.org Machine LearningMar-3-2018

We show that Entropy-SGD (Chaudhari et al., 2017), when viewed as a learning algorithm, optimizes a PAC-Bayes bound on the risk of a Gibbs (posterior) classifier, i.e., a randomized classifier obtained by a risk-sensitive perturbation of the weights of a learned classifier. Entropy-SGD works by optimizing the bound's prior, violating the hypothesis of the PAC-Bayes theorem that the prior is chosen independently of the data. Indeed, available implementations of Entropy-SGD rapidly obtain zero training error on random labels and the same holds of the Gibbs posterior. In order to obtain a valid generalization bound, we rely on a result showing that data-dependent priors obtained by stochastic gradient Langevin dynamics (SGLD) yield valid PAC-Bayes bounds provided the target distribution of SGLD is $\epsilon$-differentially private. We observe that test error on MNIST and CIFAR10 falls within the (empirically nonvacuous) risk bounds computed under the assumption that SGLD reaches stationarity. In particular, Entropy-SGLD can be configured to yield relatively tight generalization bounds and still fit real labels, although these same settings do not obtain state-of-the-art performance.

artificial intelligence, machine learning, pac-bayes, (16 more...)

arXiv.org Machine Learning

1712.09376

Country:

Europe (0.67)
North America > United States (0.46)
North America > Canada > Ontario (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Dutta, Sanghamitra, Joshi, Gauri, Ghosh, Soumyadip, Dube, Parijat, Nagpurkar, Priya

arXiv.org Machine LearningMar-3-2018

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). The novelty in our work is that our runtime analysis considers random straggler delays, which helps us design and compare distributed SGD algorithms that strike a balance between stragglers and staleness. We also present a new convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions.

artificial intelligence, learner, machine learning, (16 more...)

arXiv.org Machine Learning

1803.01113

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

On the diffusion approximation of nonconvex stochastic gradient descent

Hu, Wenqing, Li, Chris Junchi, Li, Lei, Liu, Jian-Guo

arXiv.org Machine LearningMar-3-2018

We study the Stochastic Gradient Descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp.~saddle point): it escapes in a number of iterations exponentially (resp.~almost linearly) dependent on the inverse stepsize. The results are obtained using the theory for random perturbations of dynamical systems (theory of large deviations for local minimizers and theory of exiting for unstable stationary points). In addition, we discuss the effects of batch size for the deep neural networks, and we find that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers. Our theory indicates that one should increase the batch size at later stage for the SGD to be trapped in flat minimizers for better generalization.

approximation, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1705.07562

Country: North America > United States (1.00)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

MINE: Mutual Information Neural Estimation

Belghazi, Mohamed Ishmael, Baratin, Aristide, Rajeswar, Sai, Ozair, Sherjil, Bengio, Yoshua, Courville, Aaron, Hjelm, R Devon

arXiv.org Machine LearningMar-2-2018

We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. We present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent. We present a handful of applications on which MINE can be used to minimize or maximize mutual information. We apply MINE to improve adversarially trained generative models. We also use MINE to implement Information Bottleneck, applying it in tasks related to supervised classification; our results demonstrate substantial improvement in flexibility and performance in these settings.

artificial intelligence, machine learning, mutual information, (12 more...)

arXiv.org Machine Learning

1801.04062

Genre: Research Report > New Finding (0.86)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Random Hinge Forest for Differentiable Learning

Lay, Nathan, Harrison, Adam P., Schreiber, Sharon, Dawer, Gitesh, Barbu, Adrian

arXiv.org Machine LearningMar-1-2018

We propose random hinge forests, a simple, efficient, and novel variant of decision forests. Importantly, random hinge forests can be readily incorporated as a general component within arbitrary computation graphs that are optimized end-to-end with stochastic gradient descent or variants thereof. We derive random hinge forest and ferns, focusing on their sparse and efficient nature, their min-max margin property, strategies to initialize them for arbitrary network architectures, and the class of optimizers most suitable for optimizing random hinge forest. The performance and versatility of random hinge forests are demonstrated by experiments incorporating a variety of of small and large UCI machine learning data sets and also ones involving the MNIST, Letter, and USPS image datasets. We compare random hinge forests with random forests and the more recent backpropagating deep neural decision forests.

artificial intelligence, machine learning, random hinge forest, (14 more...)

arXiv.org Machine Learning

1802.03882

Country: North America > United States (0.35)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

NEON+: Accelerated Gradient Methods for Extracting Negative Curvature for Non-Convex Optimization

Xu, Yi, Jin, Rong, Yang, Tianbao

arXiv.org Machine LearningMar-1-2018

Accelerated gradient (AG) methods are breakthroughs in convex optimization, improving the convergence rate of the gradient descent method for optimization with smooth functions. However, the analysis of AG methods for non-convex optimization is still limited. It remains an open question whether AG methods from convex optimization can accelerate the convergence of the gradient descent method for finding local minimum of non-convex optimization problems. This paper provides an affirmative answer to this question. In particular, we analyze two renowned variants of AG methods (namely Polyak's Heavy Ball method and Nesterov's Accelerated Gradient method) for extracting the negative curvature from random noise, which is central to escaping from saddle points. By leveraging the proposed AG methods for extracting the negative curvature, we present a new AG algorithm with double loops for non-convex optimization~\footnote{this is in contrast to a single-loop AG algorithm proposed in a recent manuscript~\citep{AGNON}, which directly analyzed the Nesterov's AG method for non-convex optimization and appeared online on November 29, 2017. However, we emphasize that our work is an independent work, which is inspired by our earlier work~\citep{NEON17} and is based on a different novel analysis.}, which converges to second-order stationary point $\x$ such that $\|\nabla f(\x)\|\leq \epsilon$ and $\nabla^2 f(\x)\geq -\sqrt{\epsilon} I$ with $\widetilde O(1/\epsilon^{1.75})$ iteration complexity, improving that of gradient descent method by a factor of $\epsilon^{-0.25}$ and matching the best iteration complexity of second-order Hessian-free methods for non-convex optimization.

artificial intelligence, machine learning, optimization, (13 more...)

arXiv.org Machine Learning

1712.01033

Country: North America > United States > Iowa (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

On the Sublinear Convergence of Randomly Perturbed Alternating Gradient Descent to Second Order Stationary Solutions

Lu, Songtao, Hong, Mingyi, Wang, Zhengdao

arXiv.org Machine LearningFeb-28-2018

The alternating gradient descent (AGD) is a simple but popular algorithm which has been applied to problems in optimization, machine learning, data ming, and signal processing, etc. The algorithm updates two blocks of variables in an alternating manner, in which a gradient step is taken on one block, while keeping the remaining block fixed. When the objective function is nonconvex, it is well-known the AGD converges to the first-order stationary solution with a global sublinear rate. In this paper, we show that a variant of AGD-type algorithms will not be trapped by "bad" stationary solutions such as saddle points and local maximum points. In particular, we consider a smooth unconstrained optimization problem, and propose a perturbed AGD (PA-GD) which converges (with high probability) to the set of second-order stationary solutions (SS2) with a global sublinear rate. To the best of our knowledge, this is the first alternating type algorithm which takes $\mathcal{O}(\text{polylog}(d)/\epsilon^{7/3})$ iterations to achieve SS2 with high probability [where polylog$(d)$ is polynomial of the logarithm of dimension $d$ of the problem].

artificial intelligence, machine learning, saddle point, (19 more...)

arXiv.org Machine Learning

1802.10418

Country: North America > United States (0.27)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

Add feedback

When is a Convolutional Filter Easy To Learn?

Du, Simon S., Lee, Jason D., Tian, Yuandong

arXiv.org Machine LearningFeb-28-2018

We analyze the convergence of (stochastic) gradient descent algorithm for learning a convolutional filter with Rectified Linear Unit (ReLU) activation function. Our analysis does not rely on any specific form of the input distribution and our proofs only use the definition of ReLU, in contrast with previous works that are restricted to standard Gaussian input. We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. To the best of our knowledge, this is the first recovery guarantee of gradient-based algorithms for convolutional filter on non-Gaussian input distributions. Our theory also justifies the two-stage learning rate strategy in deep neural networks. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Machine Learning

1709.06129

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

The Policy of Truth

#artificialintelligenceFeb-27-2018, 12:18:23 GMT

This is the sixth part of "An Outsider's Tour of Reinforcement Learning." Our first generic candidate for solving reinforcement learning is Policy Gradient. I find it shocking that Policy Gradient wasn't ruled out as a bad idea in 1993. Policy gradient is seductive as it apparently lets one fine tune a program to solve any problem without any domain knowledge. Of course, anything that makes such a claim must be too general for its own good.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

Add feedback