Adaptive Gradient Methods for Constrained Convex Optimization
Ene, Alina, Nguyen, Huy L., Vladu, Adrian
Gradient methods are a fundamental building block of modern machine learning. Their scalability and small memory footprint makes them exceptionally well suite d to the massive volumes of data used for present-day learning tasks. While such optimization methods perform very well in practi ce, one of their major limitations consists of their inability to converge faster by taking advantage of specific features of the input data. For example, the training data used for classification tasks may exhibit a few very informative features, while all the others have only marginal relevance. Having access t o this information a priori would enable practitioners to appropriately tune first-order optimizat ion methods, thus allowing them to train much faster. Lacking this knowledge, one may attempt to reach a si milar performance by very carefully tuning hyper-parameters, which are all specific to the learning mod el and input data. This limitation has motivated the development of adaptive m ethods, which in absence of prior knowledge concerning the importance of various features in the da ta, adapt their learning rates based on the information they acquired in previous iterations. The most notable example is AdaGrad [ 13 ], which adaptively modifies the learning rate corresponding to each coordinate in the vector of weights. Following its success, a host of new adaptive methods appeared, inc luding Adam [ 17 ], AmsGrad [ 27 ], and Shampoo [ 14 ], which attained optimal rates for generic online learning tasks.
Aug-16-2020
- Genre:
- Instructional Material > Course Syllabus & Notes (0.45)
- Research Report (0.49)
- Industry:
- Education > Educational Setting > Online (0.34)