AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Optimal Web-Scale Tiering as a Flow Problem

Leung, Gilbert, Quadrianto, Novi, Tsioutsiouliklis, Kostas, Smola, Alex J.

Neural Information Processing SystemsFeb-15-2020, 02:12:00 GMT

We present a fast online solver for large scale maximum-flow problems as they occur in portfolio optimization, inventory management, computer vision, and logistics. Our algorithm solves an integer linear program in an online fashion. It exploits total unimodularity of the constraint matrix and a Lagrangian relaxation to solve the problem as a convex online game. The algorithm generates approximate solutions of max-flow problems by performing stochastic gradient descent on a set of flows. We apply the algorithm to optimize tier arrangement of over 80 Million web pages on a layered set of caches to serve an incoming query stream optimally. We provide an empirical demonstration of the effectiveness of our method on real query-pages data.

flow problem, optimal web-scale tiering

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Periodic Step Size Adaptation for Single Pass On-line Learning

Hsu, Chun-nan, Chang, Yu-ming, Huang, Hanshen, Lee, Yuh-jye

Neural Information Processing SystemsFeb-15-2020, 01:57:37 GMT

It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. Papers published at the Neural Information Processing Systems Conference.

matrix, single pass on-line learning, step size adaptation

Neural Information Processing Systems

Genre: Instructional Material > Online (0.40)

Industry: Education > Educational Setting > Online (0.76)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.70)

Add feedback

A Stochastic Gradient Method with an Exponential Convergence _Rate for Finite Training Sets

Roux, Nicolas L., Schmidt, Mark, Bach, Francis R.

Neural Information Processing SystemsFeb-15-2020, 00:11:41 GMT

We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. Papers published at the Neural Information Processing Systems Conference.

algorithm, finite training set, stochastic gradient method

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.99)

Add feedback

On Tracking The Partition Function

Desjardins, Guillaume, Bengio, Yoshua, Courville, Aaron C.

Neural Information Processing SystemsFeb-14-2020, 23:58:49 GMT

Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function $Z$. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to {\bf track} the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change $\Delta Z$ incurred by each gradient update, (2) estimating the difference in $Z$ over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering.

log partition function, partition function, tracking, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.63)

Add feedback

Learning a Distance Metric from a Network

Shaw, Blake, Huang, Bert, Jebara, Tony

Neural Information Processing SystemsFeb-14-2020, 23:26:07 GMT

Many real-world networks are described by both connectivity information and features for every node. To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network. Like the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric. We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques. We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the running-time dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges.

distance metric, spml

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

Yao, Angela, Gall, Juergen, Gool, Luc V., Urtasun, Raquel

Neural Information Processing SystemsFeb-14-2020, 22:56:44 GMT

A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from "simple data", i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art.

latent space, probabilistic non-linear latent variable model, tracking complex activity, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Large Scale Distributed Deep Networks

Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai, Devin, Matthieu, Mao, Mark, Ranzato, Marc', aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, Le, Quoc V., Ng, Andrew Y.

Neural Information Processing SystemsFeb-14-2020, 22:41:59 GMT

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports for a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. We have successfully used our system to train a deep network 100x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories.

algorithm, deep network, downpour sgd, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.62)

Add feedback

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Recht, Benjamin, Re, Christopher, Wright, Stephen, Niu, Feng

Neural Information Processing SystemsFeb-14-2020, 22:12:03 GMT

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented *without any locking*. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence.

lock-free approach, parallelizing stochastic gradient descent, stochastic gradient descent, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Moulines, Eric, Bach, Francis R.

Neural Information Processing SystemsFeb-14-2020, 21:56:22 GMT

We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant.

machine learning, non-asymptotic analysis, stochastic approximation algorithm, (1 more...)

Neural Information Processing Systems

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.64)

Add feedback

Stochastic Gradient Descent with Only One Projection

Mahdavi, Mehrdad, Yang, Tianbao, Jin, Rong, Zhu, Shenghuo, Yi, Jinfeng

Neural Information Processing SystemsFeb-14-2020, 21:56:04 GMT

Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at {\it each} iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing a novel stochastic gradient descent algorithm that does not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, the proposed algorithms achieve an $O(1/\sqrt{T})$ convergence rate for general convex optimization, and an $O(\ln T/T)$ rate for strongly convex optimization under mild conditions about the domain and the objective function.

convex optimization, projection, stochastic gradient descent, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback