Gradient Descent
Does it make any sense to apply convolution to inputs which have no order/distance between them? • /r/MachineLearning
I know that CNN's are used a lot for computer vision where we want to deal with local features of the image. This makes sense because one pixel influences how we interpret its neighbours. If we had data for, say, medical decisions and we recorded many variables like age, weight, and existing medical conditions, these inputs have no distance between them and no sense of order which we could use to identify nearest neighbours. Having said that I could imagine that it might be useful to use a CNN for the inputs because it groups together inputs in ways which are unlikely to occur by chance if we just trained a NN by stochastic gradient descent.
Stochastic Variance Reduction for Nonconvex Optimization
Reddi, Sashank J., Hefny, Ahmed, Sra, Suvrit, Poczos, Barnabas, Smola, Alex
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Noisy Activation Functions
Gulcehre, Caglar, Moczulski, Marcin, Denil, Misha, Bengio, Yoshua
Common nonlinear activation functions used in neural networks can cause training difficulties due to the saturation behavior of the activation function, which may hide dependencies that are not visible to vanilla-SGD (using first order gradients only). Gating mechanisms that use softly saturating activation functions to emulate the discrete switching of digital logic circuits are good examples of this. We propose to exploit the injection of appropriate noise so that the gradients may flow easily, even if the noiseless application of the activation function would yield zero gradient. Large noise will dominate the noise-free gradient and allow stochastic gradient descent to explore more. By adding noise only to the problematic parts of the activation function, we allow the optimization procedure to explore the boundary between the degenerate (saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by noisy variants helps training in many contexts, yielding state-of-the-art or competitive results on different datasets and task, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.
An overview of gradient descent optimization algorithms
Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules.
Linear Regression Tutorial Using Gradient Descent for Machine Learning - Machine Learning Mastery
Stochastic Gradient Descent is an important and widely used algorithm in machine learning. In this post you will discover how to use Stochastic Gradient Descent to learn the coefficients for a simple linear regression model by minimizing the error on a training dataset. Linear Regression Tutorial Using Gradient Descent for Machine Learning Photo by Stig Nygaard, some rights reserved. Here is the raw data. The attribute x is the input variable and y is the output variable that we are trying to predict.
Saddles again
Thanks to Rong for the very nice blog post describing critical points of nonconvex functions and how to avoid them. I'd like to follow up on his post to highlight a fact that is not widely appreciated in nonlinear optimization. Though we often teach the contrary in our intro courses, it is in fact super hard to converge to a saddle point. If you move ever so slightly you fall off the saddle). Even simple algorithms like gradient descent with constant step sizes can't converge to saddle points unless you try really hard.
A Neural Network in 13 lines of Python (Part 2 - Gradient Descent) - i am trask
Summary: I learn best with toy code that I can play with. This tutorial teaches gradient descent via a very simple toy example, a short python implementation. Followup Post: I intend to write a followup post to this one adding popular features leveraged by state-of-the-art approaches (likely Dropout, DropConnect, and Momentum). Feel free to follow if you'd be interested in reading more and thanks for all the feedback! In Part 1, I laid out the basis for backpropagation in a simple neural network. Backpropagation allowed us to measure how each weight in the network contributed to the overall error. This ultimately allowed us to change these weights using a different algorithm, Gradient Descent.
Perturbed Iterate Analysis for Asynchronous Stochastic Optimization
Mania, Horia, Pan, Xinghao, Papailiopoulos, Dimitris, Recht, Benjamin, Ramchandran, Kannan, Jordan, Michael I.
We introduce and analyze stochastic optimization methods where the input to each gradient update is perturbed by bounded noise. We show that this framework forms the basis of a unified approach to analyze asynchronous implementations of stochastic optimization algorithms.In this framework, asynchronous stochastic optimization algorithms can be thought of as serial methods operating on noisy inputs. Using our perturbed iterate framework, we provide new analyses of the Hogwild! algorithm and asynchronous stochastic coordinate descent, that are simpler than earlier analyses, remove many assumptions of previous models, and in some cases yield improved upper bounds on the convergence rates. We proceed to apply our framework to develop and analyze KroMagnon: a novel, parallel, sparse stochastic variance-reduced gradient (SVRG) algorithm. We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.
Escaping from Saddle Points
Non-convex functions can be much more complicated. In this post we will discuss various types of critical points that you might encounter when you go off the convex path. In particular, we will see in many cases simple heuristics based on gradient descent can lead you to a local minimum in polynomial time. Here \eta is a small step size. This is the gradient descent algorithm.
A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements
Zheng, Qinqing, Lafferty, John
Semidefinite programming has become a key optimization tool in many areas of applied mathematics, signal processing and machine learning. SDPs often arise naturally from the problem structure, or are derived as surrogate optimizations that are relaxations of difficult combinatorial problems [7, 1, 8]. In spite of the importance of SDPs in principle--promising efficient algorithms with polynomial runtime guarantees--it is widely recognized that current optimization algorithms based on interior point methods can handle only relatively small problems. Thus, a considerable gap exists between the theory and applicability of SDP formulations. Scalable algorithms for semidefinite programming, and closely related families of nonconvex programs more generally, are greatly needed. A parallel development is the surprising effectiveness of simple classical procedures such as gradient descent for large scale problems, as explored in the recent machine learning literature. In many areas of machine learning and signal processing such as classification, deep learning, and phase retrieval, gradient descent methods, in particular first order stochastic optimization, have led to remarkably efficient algorithms that can attack very large scale problems [3, 2, 10, 6]. In this paper we build on this work to develop first-order algorithms for solving the rank minimization problem under random measurements and a closely related family of semidefinite programs. Our algorithms are efficient and scalable, and we prove that they attain linear convergence to the global optimum under natural assumptions.