AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Gradient Descent For Machine Learning - Machine Learning Mastery

#artificialintelligenceMar-22-2016, 23:00:42 GMT

Optimization is a big part of machine learning. Almost every machine learning algorithm has an optimization algorithm at it's core. In this post you will discover a simple optimization algorithm that you can use with any machine learning algorithm. It is easy to understand and easy to implement. Gradient Descent For Machine Learning Photo by Grand Canyon National Park, some rights reserved.

artificial intelligence, coefficient, machine learning, (10 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

Trading-off variance and complexity in stochastic gradient descent

Shah, Vatsal, Asteris, Megasthenis, Kyrillidis, Anastasios, Sanghavi, Sujay

arXiv.org Machine LearningMar-22-2016

Stochastic gradient descent is the method of choice for large-scale machine learning problems, by virtue of its light complexity per iteration. However, it lags behind its non-stochastic counterparts with respect to the convergence rate, due to high variance introduced by the stochastic updates. The popular Stochastic Variance-Reduced Gradient (SVRG) method mitigates this shortcoming, introducing a new update rule which requires infrequent passes over the entire input dataset to compute the full-gradient. In this work, we propose CheapSVRG, a stochastic variance-reduction optimization scheme. Our algorithm is similar to SVRG but instead of the full gradient, it uses a surrogate which can be efficiently computed on a small subset of the input data. It achieves a linear convergence rate ---up to some error level, depending on the nature of the optimization problem---and features a trade-off between the computational complexity and the convergence rate. Empirical evaluation shows that CheapSVRG performs at least competitively compared to the state of the art.

artificial intelligence, cheapsvrg, machine learning, (15 more...)

arXiv.org Machine Learning

1603.06861

Genre: Research Report (0.83)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Fast Incremental Method for Nonconvex Optimization

Reddi, Sashank J., Sra, Suvrit, Poczos, Barnabas, Smola, Alex

arXiv.org Machine LearningMar-19-2016

We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconvex problems for which SAGA converges at a linear rate to the global optimum. Finally, we analyze the practically valuable regularized and minibatch variants of SAGA. To our knowledge, this paper presents the first analysis of fast convergence for an incremental aggregated gradient method for nonconvex problems.

artificial intelligence, convergence rate, machine learning, (17 more...)

arXiv.org Machine Learning

1603.06159

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)

Add feedback

Online Learning to Sample

Bouchard, Guillaume, Trouillon, Théo, Perez, Julien, Gaidon, Adrien

arXiv.org Machine LearningMar-15-2016

Stochastic Gradient Descent (SGD) is one of the most widely used techniques for online optimization in machine learning. In this work, we accelerate SGD by adaptively learning how to sample the most useful training examples at each time step. First, we show that SGD can be used to learn the best possible sampling distribution of an importance sampling estimator. Second, we show that the sampling distribution of an SGD algorithm can be estimated online by incrementally minimizing the variance of the gradient. The resulting algorithm -- called Adaptive Weighted SGD (AW-SGD) -- maintains a set of parameters to optimize, as well as a set of parameters to sample learning examples. We show that AW-SGD yields faster convergence in three different applications: (i) image classification with deep features, where the sampling of images depends on their labels, (ii) matrix factorization, where rows and columns are not sampled uniformly, and (iii) reinforcement learning, where the optimized and exploration policies are estimated at the same time, where our approach corresponds to an off-policy gradient algorithm.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

1506.09016

Country: Europe (0.28)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

A Variational Perspective on Accelerated Methods in Optimization

Wibisono, Andre, Wilson, Ashia C., Jordan, Michael I.

arXiv.org Machine LearningMar-14-2016

Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. While many generalizations and extensions of Nesterov's original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the \emph{Bregman Lagrangian} which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods correspond to traveling the same curve in spacetime at different speeds. From this perspective, Nesterov's technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms.

artificial intelligence, lagrangian, machine learning, (17 more...)

arXiv.org Machine Learning

1603.04245

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Stochastic dual averaging methods using variance reduction techniques for regularized empirical risk minimization problems

Murata, Tomoya, Suzuki, Taiji

arXiv.org Machine LearningMar-8-2016

We consider a composite convex minimization problem associated with regularized empirical risk minimization, which often arises in machine learning. We propose two new stochastic gradient methods that are based on stochastic dual averaging method with variance reduction. Our methods generate a sparser solution than the existing methods because we do not need to take the average of the history of the solutions. This is favorable in terms of both interpretability and generalization. Moreover, our methods have theoretical support for both a strongly and a non-strongly convex regularizer and achieve the best known convergence rates among existing nonaccelerated stochastic gradient methods.

algorithm, convergence rate, convex regularizer, (8 more...)

arXiv.org Machine Learning

1603.02412

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.57)

Add feedback

Gradient Descent Converges to Minimizers

Lee, Jason D., Simchowitz, Max, Jordan, Michael I., Recht, Benjamin

arXiv.org Machine LearningMar-4-2016

Saddle points have long been regarded as a tremendous obstacle for continuous optimization. There are many well known examples when worst case initialization of gradient descent provably converge to saddle points [20, Section 1.2.3], and hardness results which show that finding even a local minimizer of nonconvex functions is NP-Hard in the worst case [19]. However, such worst-case analyses have not daunted practitioners, and high quality solutions of continuous optimization problems are readily found by a variety of simple algorithms. Building on tools from the theory of dynamical systems, this paper demonstrates that, under very mild regularity conditions, saddle points are indeed of little concern for the gradient method.

artificial intelligence, critical point, machine learning, (17 more...)

arXiv.org Machine Learning

1602.04915

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Structured Variational Auto-encoder for Learning Deep Hierarchies of Sparse Features

Salimans, Tim

arXiv.org Machine LearningFeb-28-2016

In this note we present a generative model of natural images consisting of a deep hierarchy of layers of latent random variables, each of which follows a new type of distribution that we call rectified Gaussian. These rectified Gaussian units allow spike-and-slab type sparsity, while retaining the differentiability necessary for efficient stochastic gradient variational inference. To learn the parameters of the new model, we approximate the posterior of the latent variables with a variational auto-encoder. Rather than making the usual mean-field assumption however, the encoder parameterizes a new type of structured variational approximation that retains the prior dependencies of the generative model. Using this structured posterior approximation, we are able to perform joint training of deep models with many layers of latent random variables, without having to resort to stacking or other layerwise training procedures.

artificial intelligence, machine learning, variational lower bound, (13 more...)

arXiv.org Machine Learning

1602.08734

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Variance Reduced Stochastic Gradient Descent with Neighbors

Hofmann, Thomas, Lucchi, Aurelien, Lacoste-Julien, Simon, McWilliams, Brian

arXiv.org Machine LearningFeb-26-2016

Aurelien Lucchi Department of Computer Science ETH Zurich, Switzerland Brian McWilliams Department of Computer Science ETH Zurich, Switzerland Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its slow convergence can be a computational bottleneck. Variance reduction techniques such as SAG, SVRG and SAGA have been proposed to overcome this weakness, achieving linear convergence. However, these methods are either based on computations of full gradients at pivot points, or on keeping per data point corrections in memory. Therefore speedups relative to SGD may need a minimal number of epochs in order to materialize. This paper investigates algorithms that can exploit neighborhood structure in the training data to share and reuse information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. As a side-product we provide a unified convergence analysis for a family of variance reduction algorithms, which we call memorization algorithms. We provide experimental results supporting our theory.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

1506.03662

Country: Europe > Switzerland > Zürich > Zürich (0.44)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Practical Riemannian Neural Networks

Marceau-Caron, Gaétan, Ollivier, Yann

arXiv.org Machine LearningFeb-25-2016

We provide the first experimental results on nonsynthetic datasets for the quasidiagonal Riemannian gradient descents for neural networks introduced in [Oll15]. These include the MNIST, SVHN, and FACE datasets as well as a previously unpublished electroencephalogram dataset. The quasi-diagonal Riemannian algorithms consistently beat simple stochastic gradient gradient descents by a varying margin. The computational overhead with respect to simple backpropagation is around a factor 2. Perhaps more interestingly, these methods also reach their final performance quickly, thus requiring fewer training epochs and a smaller total computation time. We also present an implementation guide to these Riemannian gradient descents for neural networks, showing how the quasi-diagonal versions can be implemented with minimal effort on top of existing routines which compute gradients. We present a practical and efficient implementation of invariant stochastic gradient descent algorithms for neural networks based on the quasi-diagonal Riemannian metrics introduced in [Oll15]. These can be implemented from the same data as RMSProp-or AdaGrad-based schemes [DHS11], namely, by collecting gradients and squared gradients for each data sample. Thus we will try to present them in a way that can easily be incorporated on top of existing software providing gradients for neural networks.

artificial intelligence, gradient, machine learning, (18 more...)

arXiv.org Machine Learning

1602.08007

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback