adagrad
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Denmark (0.04)
The Implicit Bias of AdaGrad on Separable Data
We study the implicit bias of AdaGrad on separable linear classification problems. We show that AdaGrad converges to a direction that can be characterized as the solution of a quadratic optimization problem with the same feasible set as the hard SVM problem. We also give a discussion about how different choices of the hyperparameters of AdaGrad may impact this direction. This provides a deeper understanding of why adaptive methods do not seem to have the generalization ability as good as gradient descent does in practice.
The Marginal Value of Adaptive Gradient Methods in Machine Learning
Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.
Online to Offline Conversions, Universality and Adaptive Minibatch Sizes
Over the past years data adaptiveness has proven to be crucial to the success of learning algorithms. The objective function underlying "big data" applications often demonstrates intricate structure: the scale and smoothness are often unknown and may change substantially in between different regions/directions, [
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > United States (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (2 more...)
- North America > United States (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (2 more...)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Russia (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)