Memory-Efficient Adaptive Optimization for Large-Scale Learning

Anil, Rohan, Gupta, Vineet, Koren, Tomer, Singer, Yoram

Jan-30-2019–arXiv.org Machine Learning

Adaptive gradient-based optimizers such as AdaGrad and Adam are among the methods of choice in modern machine learning. These methods maintain second-order statistics of each parameter, thus doubling the memory footprint of the optimizer. In behemoth-size applications, this memory overhead restricts the size of the model being used as well as the number of examples in a mini-batch. We describe a novel, simple, and flexible adaptive optimization method with sublinear memory cost that retains the benefits of per-parameter adaptivity while allowing for larger models and mini-batches. We give convergence guarantees for our method and demonstrate its effectiveness in training very large deep models.

algorithm, batch size, sm3, (15 more...)

arXiv.org Machine Learning

Jan-30-2019

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (0.69)
  - Machine Learning
    - Statistical Learning (0.47)
    - Neural Networks (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found