AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity

arXiv.org Machine LearningSep-25-2012

Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependence, as well. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently nonconvex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing nonconvex programs, we are able to both analyze the statistical error associated with any global optimum, and more surprisingly, to prove that a simple algorithm based on projected gradient descent will converge in polynomial time to a small neighborhood of the set of all global minimizers. On the statistical side, we provide nonasymptotic bounds that hold with high probability for the cases of noisy, missing and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm is guaranteed to converge at a geometric rate to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing close agreement with the predicted scalings.

artificial intelligence, estimator, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1214/12-AOS1018

1109.3714

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Michigan (0.04)

Genre: Research Report (0.64)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Add feedback

Fast global convergence of gradient methods for high-dimensional statistical recovery

Agarwal, Alekh, Negahban, Sahand N., Wainwright, Martin J.

arXiv.org Machine LearningJul-25-2012

Many statistical $M$-estimators are based on convex optimization problems formed by the combination of a data-dependent loss function with a norm-based regularizer. We analyze the convergence rates of projected gradient and composite gradient methods for solving such problems, working within a high-dimensional framework that allows the data dimension $\pdim$ to grow with (and possibly exceed) the sample size $\numobs$. This high-dimensional structure precludes the usual global assumptions---namely, strong convexity and smoothness conditions---that underlie much of classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that projected gradient descent has a globally geometric rate of convergence up to the \emph{statistical precision} of the model, meaning the typical distance between the true unknown parameter $\theta^*$ and an optimal solution $\hat{\theta}$. This result is substantially sharper than previous convergence results, which yielded sublinear convergence, or linear convergence only up to the noise level. Our analysis applies to a wide range of $M$-estimators and statistical models, including sparse linear regression using Lasso ($\ell_1$-regularized regression); group Lasso for block sparsity; log-linear models with regularization; low-rank matrix recovery using nuclear norm regularization; and matrix decomposition. Overall, our analysis reveals interesting connections between statistical precision and computational efficiency in high-dimensional estimation.

artificial intelligence, inequality, machine learning, (19 more...)

arXiv.org Machine Learning

1104.4824

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Adaptive Step-Size for Online Temporal Difference Learning

Dabney, William (University of Massachusetts Amherst) | Barto, Andrew G (University of Massachusetts Amherst)

AAAI ConferencesJul-21-2012

The step-size, often denoted as α, is a key parameter for most incremental learning algorithms. Its importance is especially pronounced when performing online temporal difference (TD) learning with function approximation. Several methods have been developed to adapt the step-size online. These range from straightforward back-off strategies to adaptive algorithms based on gradient descent. We derive an adaptive upper bound on the step-size parameter to guarantee that online TD learning with linear function approximation will not diverge. We then empirically evaluate algorithms using this upper bound as a heuristic for adapting the step-size parameter online. We compare performance with related work including HL(λ) and Autostep. Our results show that this adaptive upper bound heuristic out-performs all existing methods without requiring any meta-parameters. This effectively eliminates the need to tune the learning rate of temporal difference learning with linear function approximation.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

AAAI Conferences

Twenty-Sixth AAAI Conference on Artificial Intelligence

Country: North America > United States > Massachusetts > Hampshire County > Amherst (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions

Agarwal, Alekh, Negahban, Sahand, Wainwright, Martin J.

arXiv.org Machine LearningJul-18-2012

Stochastic optimization algorithms have many desirable features for large-scale machine learning, and accordingly have been the focus of renewed and intensive study in the last several years (e.g., see the papers [26, 4, 10, 30] and references therein). The empirical efficiency of these methods is backed with strong theoretical guarantees, providing sharp bounds on their convergence rates. These convergence rates are known to depend on the structure of the underlying objective function, with faster rates being possible for objective functions that are smooth and/or (strongly) convex, or optima that have desirable features such as sparsity. More precisely, for an objective function that is strongly convex, stochastic gradient descent enjoys a convergence rate ranging from O(1/T), when features vectors are extremely sparse, to O(d/T) when feature vectors are dense [11, 19, 12]. Such results are of significant interest, because the strong convexity condition is satisfied for many common machine learning problems, including boosting, least squares regression, support vector machines and generalized linear models, among other examples. A complementary type of condition is that of sparsity, either exact or approximate, in the optimal solution.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1207.4421

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.54)

Add feedback

Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

Ahn, Sungjin, Korattikara, Anoop, Welling, Max

arXiv.org Machine LearningJun-27-2012

In this paper we address the following question: "Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?". An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit Theorem, we extend the SGLD algorithm so that at high mixing rates it will sample from a normal approximation of the posterior, while for slow mixing rates it will mimic the behavior of SGLD with a pre-conditioner matrix. As a bonus, the proposed algorithm is reminiscent of Fisher scoring (with stochastic gradients) and as such an efficient optimizer during burn-in.

algorithm, artificial intelligence, machine learning, (13 more...)

arXiv.org Machine Learning

1206.638

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Orange County > Irvine (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning

Zhao, Peilin, Wang, Jialei, Wu, Pengcheng, Jin, Rong, Hoi, Steven C. H.

arXiv.org Machine LearningJun-18-2012

Kernel-based online learning has often shown state-of-the-art performance for many online learning tasks. It, however, suffers from a major shortcoming, that is, the unbounded number of support vectors, making it non-scalable and unsuitable for applications with large-scale datasets. In this work, we study the problem of bounded kernel-based online learning that aims to constrain the number of support vectors by a predefined budget. Although several algorithms have been proposed in literature, they are neither computationally efficient due to their intensive budget maintenance strategy nor effective due to the use of simple Perceptron algorithm. To overcome these limitations, we propose a framework for bounded kernel-based online learning based on an online gradient descent approach. We propose two efficient algorithms of bounded online gradient descent (BOGD) for scalable kernel-based online learning: (i) BOGD by maintaining support vectors using uniform sampling, and (ii) BOGD++ by maintaining support vectors using non-uniform sampling. We present theoretical analysis of regret bound for both algorithms, and found promising empirical performance in terms of both efficacy and efficiency by comparing them to several well-known algorithms for bounded kernel-based online learning on large-scale datasets.

algorithm, artificial intelligence, machine learning, (11 more...)

arXiv.org Machine Learning

1206.4633

Country:

North America > United States > Michigan (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.57)

Add feedback

BPR: Bayesian Personalized Ranking from Implicit Feedback

Rendle, Steffen, Freudenthaler, Christoph, Gantner, Zeno, Schmidt-Thieme, Lars

arXiv.org Machine LearningMay-9-2012

Item recommendation is the task of predicting a personalized ranking on a set of items (e.g. websites, movies, products). In this paper, we investigate the most common scenario with implicit feedback (e.g. clicks, purchases). There are many methods for item recommendation from implicit feedback like matrix factorization (MF) or adaptive knearest-neighbor (kNN). Even though these methods are designed for the item prediction task of personalized ranking, none of them is directly optimized for ranking. In this paper we present a generic optimization criterion BPR-Opt for personalized ranking that is the maximum posterior estimator derived from a Bayesian analysis of the problem. We also provide a generic learning algorithm for optimizing models with respect to BPR-Opt. The learning method is based on stochastic gradient descent with bootstrap sampling. We show how to apply our method to two state-of-the-art recommender models: matrix factorization and adaptive kNN. Our experiments indicate that for the task of personalized ranking our optimization method outperforms the standard learning techniques for MF and kNN. The results show the importance of optimizing models for the right criterion.

artificial intelligence, machine learning, matrix factorization, (13 more...)

arXiv.org Machine Learning

1205.2618

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Puerto Rico > San Juan > San Juan (0.04)
Europe > Germany (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)

Add feedback

The Natural Gradient by Analogy to Signal Whitening, and Recipes and Tricks for its Use

Sohl-Dickstein, Jascha

arXiv.org Machine LearningMay-8-2012

The natural gradient, as introduced by [Amari, 1987], allows for more efficient gradient descent by removing dependencies and biases inherent in a function's parameterization. Several papers present the topic thoroughly and precisely [Amari, 1987, Amari, 1998, Amari and Nagaoka, 2000, Theis, 2005, Amari, 2010]. It remains a very difficult idea to get your head around however. The intent of this note is to provide simple intuition for the natural gradient and its uses. We review how an ill conditioned parameter space can undermine learning, introduce the natural gradient by analogy to the more widely understood concept of signal whitening, and present tricks and specific prescriptions for applying the natural gradient to learning problems.

artificial intelligence, gradient, machine learning, (17 more...)

arXiv.org Machine Learning

1205.1828

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > California (0.04)
Asia > China (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.39)

Add feedback

Fast projections onto mixed-norm balls with applications

Sra, Suvrit

arXiv.org Machine LearningApr-6-2012

Joint sparsity offers powerful structural cues for feature selection, especially for variables that are expected to demonstrate a "grouped" behavior. Such behavior is commonly modeled via group-lasso, multitask lasso, and related methods where feature selection is effected via mixed-norms. Several mixed-norm based sparse models have received substantial attention, and for some cases efficient algorithms are also available. Surprisingly, several constrained sparse models seem to be lacking scalable algorithms. We address this deficiency by presenting batch and online (stochastic-gradient) optimization methods, both of which rely on efficient projections onto mixed-norm balls. We illustrate our methods by applying them to the multitask lasso. We conclude by mentioning some open problems.

artificial intelligence, machine learning, projection, (16 more...)

arXiv.org Machine Learning

1204.1437

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > United States > Wisconsin (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(5 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Add feedback

Combining Spatial and Telemetric Features for Learning Animal Movement Models

Kapicioglu, Berk, Schapire, Robert E., Wikelski, Martin, Broderick, Tamara

arXiv.org Machine LearningMar-15-2012

We introduce a new graphical model for tracking radio-tagged animals and learning their movement patterns. The model provides a principled way to combine radio telemetry data with an arbitrary set of userdefined, spatial features. We describe an efficient stochastic gradient algorithm for fitting model parameters to data and demonstrate its effectiveness via asymptotic analysis and synthetic experiments. We also apply our model to real datasets, and show that it outperforms the most popular radio telemetry software package used in ecology. We conclude that integration of different data sources under a single statistical framework, coupled with appropriate parameter and state estimation procedures, produces both accurate location estimates and an interpretable statistical model of animal movement.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

1203.3486

Country: North America > United States > Colorado (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

Add feedback