AITopics

arXiv.org Machine LearningApr-4-2016

Stochastic Variance Reduction for Nonconvex Optimization

Reddi, Sashank J., Hefny, Ahmed, Sra, Suvrit, Poczos, Barnabas, Smola, Alex

We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.

artificial intelligence, deep learning, machine learning, (15 more...)

1603.0616

Country: North America > United States (0.46)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Gulcehre, Caglar, Moczulski, Marcin, Denil, Misha, Bengio, Yoshua

Noisy Activation Functions

arXiv.org Machine LearningApr-3-2016

Common nonlinear activation functions used in neural networks can cause training difficulties due to the saturation behavior of the activation function, which may hide dependencies that are not visible to vanilla-SGD (using first order gradients only). Gating mechanisms that use softly saturating activation functions to emulate the discrete switching of digital logic circuits are good examples of this. We propose to exploit the injection of appropriate noise so that the gradients may flow easily, even if the noiseless application of the activation function would yield zero gradient. Large noise will dominate the noise-free gradient and allow stochastic gradient descent to explore more. By adding noise only to the problematic parts of the activation function, we allow the optimization procedure to explore the boundary between the degenerate (saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by noisy variants helps training in many contexts, yielding state-of-the-art or competitive results on different datasets and task, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.

artificial intelligence, deep learning, machine learning, (14 more...)

1603.00391

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Semiconductors & Electronics (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

#artificialintelligenceApr-2-2016, 00:56:46 GMT

An overview of gradient descent optimization algorithms

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules.

artificial intelligence, gradient, machine learning, (18 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

#artificialintelligenceMar-30-2016, 01:10:03 GMT

Linear Regression Tutorial Using Gradient Descent for Machine Learning - Machine Learning Mastery

Stochastic Gradient Descent is an important and widely used algorithm in machine learning. In this post you will discover how to use Stochastic Gradient Descent to learn the coefficients for a simple linear regression model by minimizing the error on a training dataset. Linear Regression Tutorial Using Gradient Descent for Machine Learning Photo by Stig Nygaard, some rights reserved. Here is the raw data. The attribute x is the input variable and y is the output variable that we are trying to predict.

artificial intelligence, coefficient, machine learning, (14 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

#artificialintelligenceMar-29-2016, 06:15:57 GMT

Saddles again

Thanks to Rong for the very nice blog post describing critical points of nonconvex functions and how to avoid them. I'd like to follow up on his post to highlight a fact that is not widely appreciated in nonlinear optimization. Though we often teach the contrary in our intro courses, it is in fact super hard to converge to a saddle point. If you move ever so slightly you fall off the saddle). Even simple algorithms like gradient descent with constant step sizes can't converge to saddle points unless you try really hard.

artificial intelligence, converge, machine learning, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.45)

#artificialintelligenceMar-28-2016, 07:40:40 GMT

A Neural Network in 13 lines of Python (Part 2 - Gradient Descent) - i am trask

Summary: I learn best with toy code that I can play with. This tutorial teaches gradient descent via a very simple toy example, a short python implementation. Followup Post: I intend to write a followup post to this one adding popular features leveraged by state-of-the-art approaches (likely Dropout, DropConnect, and Momentum). Feel free to follow if you'd be interested in reading more and thanks for all the feedback! In Part 1, I laid out the basis for backpropagation in a simple neural network. Backpropagation allowed us to measure how each weight in the network contributed to the overall error. This ultimately allowed us to change these weights using a different algorithm, Gradient Descent.

artificial intelligence, gradient descent, machine learning, (17 more...)

Genre: Instructional Material > Course Syllabus & Notes (0.75)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)

Mania, Horia, Pan, Xinghao, Papailiopoulos, Dimitris, Recht, Benjamin, Ramchandran, Kannan, Jordan, Michael I.

Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

arXiv.org Machine LearningMar-25-2016

We introduce and analyze stochastic optimization methods where the input to each gradient update is perturbed by bounded noise. We show that this framework forms the basis of a unified approach to analyze asynchronous implementations of stochastic optimization algorithms.In this framework, asynchronous stochastic optimization algorithms can be thought of as serial methods operating on noisy inputs. Using our perturbed iterate framework, we provide new analyses of the Hogwild! algorithm and asynchronous stochastic coordinate descent, that are simpler than earlier analyses, remove many assumptions of previous models, and in some cases yield improved upper bounds on the convergence rates. We proceed to apply our framework to develop and analyze KroMagnon: a novel, parallel, sparse stochastic variance-reduced gradient (SVRG) algorithm. We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.

artificial intelligence, hogwild, machine learning, (16 more...)

1507.0697

Genre: Research Report (0.84)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

@machinelearnbotMar-24-2016, 13:05:31 GMT

Escaping from Saddle Points

Non-convex functions can be much more complicated. In this post we will discuss various types of critical points that you might encounter when you go off the convex path. In particular, we will see in many cases simple heuristics based on gradient descent can lead you to a local minimum in polynomial time. Here \eta is a small step size. This is the gradient descent algorithm.

artificial intelligence, machine learning, saddle point, (17 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)

Zheng, Qinqing, Lafferty, John

A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements

arXiv.org Machine LearningMar-24-2016

Semidefinite programming has become a key optimization tool in many areas of applied mathematics, signal processing and machine learning. SDPs often arise naturally from the problem structure, or are derived as surrogate optimizations that are relaxations of difficult combinatorial problems [7, 1, 8]. In spite of the importance of SDPs in principle--promising efficient algorithms with polynomial runtime guarantees--it is widely recognized that current optimization algorithms based on interior point methods can handle only relatively small problems. Thus, a considerable gap exists between the theory and applicability of SDP formulations. Scalable algorithms for semidefinite programming, and closely related families of nonconvex programs more generally, are greatly needed. A parallel development is the surprising effectiveness of simple classical procedures such as gradient descent for large scale problems, as explored in the recent machine learning literature. In many areas of machine learning and signal processing such as classification, deep learning, and phase retrieval, gradient descent methods, in particular first order stochastic optimization, have led to remarkably efficient algorithms that can attack very large scale problems [3, 2, 10, 6]. In this paper we build on this work to develop first-order algorithms for solving the rank minimization problem under random measurements and a closely related family of semidefinite programs. Our algorithms are efficient and scalable, and we prove that they attain linear convergence to the global optimum under natural assumptions.

algorithm, artificial intelligence, machine learning, (16 more...)

1506.06081

Genre: Research Report (0.82)

Industry: Education > Curriculum (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)