AITopics | Stern, Mitchell

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Wilson, Ashia C., Roelofs, Rebecca, Stern, Mitchell, Srebro, Nati, Recht, Benjamin

Neural Information Processing SystemsDec-31-2017

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

adaptive method, deep learning, neural network, (17 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Tripuraneni, Nilesh, Stern, Mitchell, Jin, Chi, Regier, Jeffrey, Jordan, Michael I.

arXiv.org Machine LearningDec-5-2017

This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak 2006]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only $\mathcal{\tilde{O}}(\epsilon^{-3.5})$ stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the $\mathcal{\tilde{O}}(\epsilon^{-4})$ rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.

artificial intelligence, machine learning, ɛ-second-order stationary point, (12 more...)

arXiv.org Machine Learning

1711.02838

Country: North America > United States > California (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.99)

Add feedback

Kernel Feature Selection via Conditional Covariance Minimization

Chen, Jianbo, Stern, Mitchell, Wainwright, Martin J., Jordan, Michael I.

arXiv.org Machine LearningJul-4-2017

We propose a framework for feature selection that employs kernel-based measures of independence to find a subset of covariates that is maximally predictive of the response. Building on past work in kernel dimension reduction, we formulate our approach as a constrained optimization problem involving the trace of the conditional covariance operator, and additionally provide some consistency results. We then demonstrate on a variety of synthetic and real data sets that our method compares favorably with other state-of-the-art algorithms.

feature selection, optimization problem, survey article, (15 more...)

arXiv.org Machine Learning

1707.01164

Country: North America > United States > California (0.28)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Wilson, Ashia C., Roelofs, Rebecca, Stern, Mitchell, Srebro, Nathan, Recht, Benjamin

arXiv.org Machine LearningMay-23-2017

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

adaptive method, deep learning, neural network, (18 more...)

arXiv.org Machine Learning

1705.08292

Country:

North America > United States > Texas (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Abstract Syntax Networks for Code Generation and Semantic Parsing

Rabinovich, Maxim, Stern, Mitchell, Klein, Dan

arXiv.org Machine LearningApr-25-2017

Tasks like code generation and semantic parsing require mapping unstructured (or partially structured) inputs to well-formed, executable outputs. We introduce abstract syntax networks, a modeling framework for these problems. The outputs are represented as abstract syntax trees (ASTs) and constructed by a decoder with a dynamically-determined modular structure paralleling the structure of the output tree. On the benchmark Hearthstone dataset for code generation, our model obtains 79.2 BLEU and 22.7% exact match accuracy, compared to previous state-of-the-art values of 67.1 and 6.1%. Furthermore, we perform competitively on the Atis, Jobs, and Geo semantic parsing datasets with no task-specific engineering.

deep learning, module, neural network, (19 more...)

arXiv.org Machine Learning

1704.07535

Country: