AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Comunication-Efficient Algorithms for Statistical Optimization

Zhang, Yuchen, Duchi, John C., Wainwright, Martin

arXiv.org Machine LearningOct-11-2013

We analyze two communication-efficient algorithms for distributed statistical optimization on large-scale data sets. The first algorithm is a standard averaging method that distributes the $N$ data samples evenly to $\nummac$ machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as $\order(N^{-1}+(N/m)^{-2})$. Whenever $m \le \sqrt{N}$, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all $\totalnumobs$ samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as $\order(N^{-1} + (N/m)^{-3})$, and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as $O(N^{-1} + (N/ m)^{-3/2})$, easing computation at the expense of penalties in the rate of convergence. We also provide experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with $N \approx 2.4 \times 10^8$ samples and $d \approx 740,000$ covariates.

artificial intelligence, inequality, machine learning, (17 more...)

arXiv.org Machine Learning

1209.4129

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming

Ghadimi, Saeed, Lan, Guanghui

arXiv.org Machine LearningSep-21-2013

In this paper, we introduce a new stochastic approximation (SA) type algorithm, namely the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming (SP) problems. We establish the complexity of this method for computing an approximate stationary point of a nonlinear programming problem. We also show that this method possesses a nearly optimal rate of convergence if the problem is convex. We discuss a variant of the algorithm which consists of applying a post-optimization phase to evaluate a short list of solutions generated by several independent runs of the RSG method, and show that such modification allows to improve significantly the large-deviation properties of the algorithm. These methods are then specialized for solving a class of simulation-based optimization problems in which only stochastic zeroth-order information is available.

artificial intelligence, machine learning, optimization problem, (17 more...)

arXiv.org Machine Learning

1309.5549

Country:

North America > United States > Florida > Alachua County > Gainesville (0.14)
North America > United States > New York (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Shi, Jianping (Chinese University of Hong Kong) | Wang, Naiyan (Hong Kong University of Science and Technology) | Xia, Yang (Chinese University of Hong Kong) | Yeung, Dit-Yan (Hong Kong University of Science and Technology) | King, Irwin (Chinese University of Hong Kong) | Jia, Jiaya (Chinese University of Hong Kong)

AAAI ConferencesAug-3-2013

Matrix factorization (MF) is a popular collaborative filtering approach for recommender systems due to its simplicity and effectiveness. Existing MF methods either assume that all latent features are uncorrelated or assume that all are correlated. To address the important issue of what structure should be imposed on the features, we investigate the covariance matrix of the latent features learned from real data. Based on the findings, we propose an MF model with a sparse covariance prior which favors a sparse yet non-diagonal covariance matrix. Not only can this reflect the semantics more faithfully, but imposing sparsity can also have a side effect of preventing overfitting. Starting from a probabilistic generative model with a sparse covariance prior, we formulate the model inference problem as a maximum a posteriori (MAP) estimation problem. The optimization procedure makes use of stochastic gradient descent and majorization-minimization. For empirical validation, we conduct experiments using the MovieLens and Netflix datasets to compare the proposed method with two strong baselines which use different priors. Experimental results show that our sparse covariance prior can lead to performance improvement.

collaborative filtering, scmf, sparse covariance matrix factorization

AAAI Conferences

Twenty-Third International Joint Conference on Artificial Intelligence

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

Add feedback

Basis Adaptation for Sparse Nonlinear Reinforcement Learning

Mahadevan, Sridhar (University of Massachusetts, Amherst) | Giguere, Stephen (University of Massachusetts, Amherst) | Jacek, Nicholas (University of Massachusetts, Amherst)

AAAI ConferencesJul-9-2013

This paper presents a new approach to representation discovery in reinforcement learning (RL) using basis adaptation. We introduce a general framework for basis adaptation as {\em nonlinear separable least-squares value function approximation} based on finding Frechet gradients of an error function using variable projection functionals. We then present a scalable proximal gradient-based approach for basis adaptation using the recently proposed mirror-descent framework for RL. Unlike traditional temporal-difference (TD) methods for RL, mirror descent based RL methods undertake proximal gradient updates of weights in a dual space, which is linked together with the primal space using a Legendre transform involving the gradient of a strongly convex function. Mirror descent RL can be viewed as a proximal TD algorithm using Bregman divergence as the distance generating function. We present a new class of regularized proximal-gradient based TD methods, which combine feature selection through sparse L1 regularization and basis adaptation. Experimental results are provided to illustrate and validate the approach.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

AAAI Conferences

Twenty-Seventh AAAI Conference on Artificial Intelligence

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Early stopping and non-parametric regression: An optimal data-dependent stopping rule

Raskutti, Garvesh, Wainwright, Martin J., Yu, Bin

arXiv.org Machine LearningJun-15-2013

The strategy of early stopping is a regularization technique based on choosing a stopping time for an iterative algorithm. Focusing on non-parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient-descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(P)$ and $L^2(P_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator.

artificial intelligence, machine learning, regression, (16 more...)

arXiv.org Machine Learning

1306.3574

Country:

North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Add feedback

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

Bach, Francis, Moulines, Eric

arXiv.org Machine LearningJun-10-2013

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which includes machine learning methods based on the minimization of the empirical risk. We focus on problems without strong convexity, for which all previously known algorithms achieve a convergence rate for function values of O(1/n^{1/2}). We consider and analyze two algorithms that achieve a rate of O(1/n) for classical supervised learning problems. For least-squares regression, we show that averaged stochastic gradient descent with constant step-size achieves the desired rate. For logistic regression, this is achieved by a simple novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running time complexity as stochastic gradient descent. For these algorithms, we provide a non-asymptotic analysis of the generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments on standard machine learning benchmarks showing that they often outperform existing approaches.

artificial intelligence, log 10, machine learning, (17 more...)

arXiv.org Machine Learning

1306.2119

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
Europe > United Kingdom > England > West Sussex (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Education (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Expectation-maximization for logistic regression

Scott, James G., Sun, Liang

arXiv.org Machine LearningMay-31-2013

We present a family of expectation-maximization (EM) algorithms for binary and negative-binomial logistic regression, drawing a sharp connection with the variational-Bayes algorithm of Jaakkola and Jordan (2000). Indeed, our results allow a version of this variational-Bayes approach to be re-interpreted as a true EM algorithm. We study several interesting features of the algorithm, and of this previously unrecognized connection with variational Bayes. We also generalize the approach to sparsity-promoting priors, and to an online method whose convergence properties are easily established. This latter method compares favorably with stochastic-gradient descent in situations with marked collinearity.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1306.004

Country:

Asia > Middle East > Jordan (0.24)
North America > United States > Texas > Travis County > Austin (0.04)

Genre:

Research Report > New Finding (0.56)
Research Report > Experimental Study (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

A Simulated Annealing Clustering Algorithm Based On Center Perturbation Using Gaussian Mutation

Merendino, Stephen (Louisiana State University in Shreveport) | Celebi, M. Emre (Louisiana State University in Shreveport)

AAAI ConferencesMay-19-2013

Clustering, the unsupervised classification of objects into groups, is a widely used technique in exploratory data analysis. The clustering problem is a very complex one, and a popular heuristic for solving it is the Simulated Annealing (SA) algorithm. SA is an approximation algorithm that involves generating a neighborhood solution by perturbing the current solution in a small, yet meaningful way. This new solution is accepted with a probability of 1 if it is quantitatively better than the current solution, and accepted according to the Metropolis criterion otherwise. Cluster quality is measured using the Sum of Squared Error (SSE) criterion. This paper presents an SA algorithm that uses a new type of perturbation to generate solutions. Whereas most SA clustering algorithms perturb data point memberships directly, our algorithm perturbs a randomly chosen center using Gaussian mutation, and then reassigns data points in a nearest neighbor fashion. Experimental results on a diverse collection of data sets demonstrate that our algorithm has comparable effectiveness to other SA algorithms, while being much faster due to its simplicity.

center perturbation, gaussian mutation, simulated annealing clustering algorithm

AAAI Conferences

The Twenty-Sixth International FLAIRS Conference

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)

Add feedback

Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Shalev-Shwartz, Shai, Zhang, Tong

arXiv.org Machine LearningMay-12-2013

Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of \cite{nesterov2007gradient}.

artificial intelligence, iteration, machine learning, (16 more...)

arXiv.org Machine Learning

1305.2581

Country:

North America > United States (0.04)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Schaul, Tom, LeCun, Yann

arXiv.org Artificial IntelligenceMar-27-2013

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in non-stationary tasks. Here, we extend the idea in three directions, addressing proper minibatch parallelization, including reweighted updates for sparse or orthogonal gradients, improving robustness on non-smooth loss functions, in the process replacing the diagonal Hessian estimation procedure that may not always be available by a robust finite-difference approximation. The final algorithm integrates all these components, has linear complexity and is hyper-parameter free.

artificial intelligence, gradient, machine learning, (17 more...)

arXiv.org Artificial Intelligence

1301.3764

Country: Oceania > Tonga (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback