AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

The News on Auto-tuning

#artificialintelligenceJun-21-2016, 20:45:02 GMT

Ed. Note: this post is in my voice, but it was co-written with Kevin Jamieson. Kevin provided the awesome plots too. It's all the rage in machine learning these days to build complex, deep pipelines with thousands of tunable parameters. Now, I don't mean parameters that we learn by stochastic gradient descent. But I mean architectural concerns, like the value of the regularization parameter, the size of a convolutional window, or the breadth of a spatio-temporal tower of attention.

artificial intelligence, machine learning, random search, (15 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

[1606.04474] Learning to learn by gradient descent by gradient descent • /r/MachineLearning

@machinelearnbotJun-15-2016, 03:10:13 GMT

One thing, which I'm not sure, is how correct is their comparison. By that I mean that they fix the global learning rate for the "hand designed" algos and choose it by grid search. However, we do know well that in most problems we can start with a larger learning rate an decay it over time after it platoes. The issue of not conisdering that probably the best global learning rate for the whole run, would be one which is very slow, but eventually outperforms faster ones. Nevertheless, this is an interesting work, although I'm still quite skeptical of such optimiziers to generalize well on large models.

artificial intelligence, gradient descent, machine learning, (2 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)

Add feedback

A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning

Mokhtari, Aryan, Koppel, Alec, Ribeiro, Alejandro

arXiv.org Machine LearningJun-15-2016

We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple parallel processors to operate on a randomly chosen subset of blocks of the feature vector. We call the algorithm stochastic because processors choose training subsets uniformly at random. Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both the selection of blocks and the selection of elements of the training set. In RAPSA, processors utilize the randomly chosen functions to compute the stochastic gradient component associated with a randomly chosen block. The technical contribution of this paper is to show that this minimally coordinated algorithm converges to the optimal classifier when the training objective is convex. Moreover, we present an accelerated version of RAPSA (ARAPSA) that incorporates the objective function curvature information by premultiplying the descent direction by a Hessian approximation matrix. We further extend the results for asynchronous settings and show that if the processors perform their updates without any coordination the algorithms are still convergent to the optimal argument. RAPSA and its extensions are then numerically evaluated on a linear estimation problem and a binary image classification task using the MNIST handwritten digit dataset.

artificial intelligence, machine learning, rapsa, (14 more...)

arXiv.org Machine Learning

1606.04991

Country: North America > United States > Pennsylvania (0.28)

Genre: Research Report (1.00)

Industry: Education (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback

Kalman-based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning

Patel, Vivak

arXiv.org Machine LearningJun-10-2016

Modern proximal and stochastic gradient descent (SGD) methods are believed to efficiently minimize large composite objective functions, but such methods have two algorithmic challenges: (1) a lack of fast or justified stop conditions, and (2) sensitivity to the objective function's conditioning. In response to the first challenge, modern proximal and SGD methods guarantee convergence only after multiple epochs, but such a guarantee renders proximal and SGD methods infeasible when the number of component functions is very large or infinite. In response to the second challenge, second order SGD methods have been developed, but they are marred by the complexity of their analysis. In this work, we address these challenges on the limited, but important, linear regression problem by introducing and analyzing a second order proximal/SGD method based on Kalman Filtering (kSGD). Through our analysis, we show kSGD is asymptotically optimal, develop a fast algorithm for very large, infinite or streaming data sources with a justified stop condition, prove that kSGD is insensitive to the problem's conditioning, and develop a unique approach for analyzing the complex second order dynamics. Our theoretical results are supported by numerical experiments on three regression problems (linear, nonparametric wavelet, and logistic) using three large publicly available datasets. Moreover, our analysis and experiments lay a foundation for embedding kSGD in multiple epoch algorithms, extending kSGD to other problem classes, and developing parallel and low memory kSGD implementations.

artificial intelligence, machine learning, regression problem, (16 more...)

arXiv.org Machine Learning

doi: 10.1137/15M1048239

1512.01139

Country: North America > United States (0.67)

Genre: Research Report (0.83)

Industry:

Health & Medicine > Health Care Providers & Services > Reimbursement (0.46)
Health & Medicine > Government Relations & Public Policy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback

Towards stability and optimality in stochastic gradient descent

Toulis, Panos, Tran, Dustin, Airoldi, Edoardo M.

arXiv.org Machine LearningJun-7-2016

Iterative procedures for parameter estimation based on stochastic gradient descent allow the estimation to scale to massive data sets. However, in both theory and practice, they suffer from numerical instability. Moreover, they are statistically inefficient as estimators of the true parameter value. To address these two issues, we propose a new iterative procedure termed averaged implicit SGD (AI-SGD). For statistical efficiency, AI-SGD employs averaging of the iterates, which achieves the optimal Cram\'{e}r-Rao bound under strong convexity, i.e., it is an optimal unbiased estimator of the true parameter value. For numerical stability, AI-SGD employs an implicit update at each iteration, which is related to proximal operators in optimization. In practice, AI-SGD achieves competitive performance with other state-of-the-art procedures. Furthermore, it is more stable than averaging procedures that do not employ proximal updates, and is simple to implement as it requires fewer tunable hyperparameters than procedures that do employ proximal updates.

artificial intelligence, machine learning, procedure, (15 more...)

arXiv.org Machine Learning

1505.02417

Genre: Research Report > New Finding (0.46)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Visualizing the gradient descent method

#artificialintelligenceJun-5-2016, 20:05:42 GMT

In the gradient descent method of optimization, a hypothesis function, h_\boldsymbol{\theta}(x), is fitted to a data set, (x {(i)}, y {(i)}) ( i 1,2,\cdots,m) by minimizing an associated cost function, J(\boldsymbol{\theta}) in terms of the parameters \boldsymbol\theta \theta_0, \theta_1, \cdots . The cost function describes how closely the hypothesis fits the data for a given choice of \boldsymbol \theta . For example, one might wish to fit a given data set to a straight line, h_\boldsymbol{\theta}(x) \theta_0 \theta_1 x. To simplify things, consider fitting a data set to a straight line through the origin: h_\theta(x) \theta_1 x . In this one-dimensional problem, we can plot a simple graph for J(\theta_1) and follow the iterative procedure which trys to converge on its minimum.

artificial intelligence, boldsymbol, machine learning, (9 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

An overview of gradient descent optimization algorithms

#artificialintelligenceJun-4-2016, 11:15:33 GMT

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules. We will also take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.

artificial intelligence, gradient, machine learning, (18 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

[1606.00511] Large Scale Distributed Hessian-Free Optimization for Deep Neural Network • /r/MachineLearning

@machinelearnbotJun-3-2016, 07:40:59 GMT

Training deep neural network is a high dimensional and a highly non-convex optimization problem. Stochastic gradient descent (SGD) algorithm and it's variations are the current state-of-the-art solvers for this task. However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point. Recent empirical work claim that by detecting and escaping saddle point efficiently, it's more likely to improve training performance. With this objective, we revisit Hessian-free optimization method for deep networks.

deep learning, hessian-free optimization, machine learning, (3 more...)

@machinelearnbot

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Black-box $\alpha$-divergence Minimization

Hernández-Lobato, José Miguel, Li, Yingzhen, Rowland, Mark, Hernández-Lobato, Daniel, Bui, Thang, Turner, Richard E.

arXiv.org Machine LearningJun-1-2016

Black-box alpha (BB-$\alpha$) is a new approximate inference method based on the minimization of $\alpha$-divergences. BB-$\alpha$ scales to large datasets because it can be implemented using stochastic gradient descent. BB-$\alpha$ can be applied to complex probabilistic models with little effort since it only requires as input the likelihood function and its gradients. These gradients can be easily obtained using automatic differentiation. By changing the divergence parameter $\alpha$, the method is able to interpolate between variational Bayes (VB) ($\alpha \rightarrow 0$) and an algorithm similar to expectation propagation (EP) ($\alpha = 1$). Experiments on probit regression and neural network regression and classification problems show that BB-$\alpha$ with non-standard settings of $\alpha$, such as $\alpha = 0.5$, usually produces better predictions than with $\alpha \rightarrow 0$ (VB) or $\alpha = 1$ (EP).

artificial intelligence, machine learning, posterior, (19 more...)

arXiv.org Machine Learning

1511.03243

Country:

North America > United States (0.46)
Europe (0.28)

Genre: Research Report (1.00)

Industry:

Energy (0.68)
Transportation > Air (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

The local convexity of solving systems of quadratic equations

White, Chris D., Sanghavi, Sujay, Ward, Rachel

arXiv.org Machine LearningJun-1-2016

This paper considers the recovery of a rank $r$ positive semidefinite matrix $X X^T\in\mathbb{R}^{n\times n}$ from $m$ scalar measurements of the form $y_i := a_i^T X X^T a_i$ (i.e., quadratic measurements of $X$). Such problems arise in a variety of applications, including covariance sketching of high-dimensional data streams, quadratic regression, quantum state tomography, among others. A natural approach to this problem is to minimize the loss function $f(U) = \sum_i (y_i - a_i^TUU^Ta_i)^2$ which has an entire manifold of solutions given by $\{XO\}_{O\in\mathcal{O}_r}$ where $\mathcal{O}_r$ is the orthogonal group of $r\times r$ orthogonal matrices; this is {\it non-convex} in the $n\times r$ matrix $U$, but methods like gradient descent are simple and easy to implement (as compared to semidefinite relaxation approaches). In this paper we show that once we have $m \geq C nr \log^2(n)$ samples from isotropic gaussian $a_i$, with high probability {\em (a)} this function admits a dimension-independent region of {\em local strong convexity} on lines perpendicular to the solution manifold, and {\em (b)} with an additional polynomial factor of $r$ samples, a simple spectral initialization will land within the region of convexity with high probability. Together, this implies that gradient descent with initialization (but no re-sampling) will converge linearly to the correct $X$, up to an orthogonal transformation. We believe that this general technique (local convexity reachable by spectral initialization) should prove applicable to a broader class of nonconvex optimization problems.

artificial intelligence, machine learning, optimization problem, (17 more...)

arXiv.org Machine Learning

1506.07868

Country: North America > United States > Texas (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback