Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

Xiao, Ke Liang, Marshall, Noah, Agarwala, Atish, Paquette, Elliot

Nov-18-2024–arXiv.org Machine Learning

The success of deep learning has been driven by the effectiveness of relatively simple stochastic optimization algorithms. Stochastic gradient descent ( SGD) with momentum can be used to train models like ResNet50 with minimal hyperparameter tuning. The workhorse of modern machine learning is Adam, which was designed to give an approximation of preconditioning with a diagonal, online approximation of the Fisher information matrix (Kingma, 2014). Additional hypotheses for the success of Adam include its ability to maintain balanced updates to parameters across layers and its potential noise-mitigating effects (Zhang et al., 2020; 2024). Getting a quantitative, theoretical understanding of Adam and its variants is hindered by their complexity. While the multiple exponential moving averages are easy to implement, they complicate analysis. The practical desire for simpler, more efficient learning algorithms as well as the theoretical desire for simpler models to analyze have led to a resurgence in the study of signSGD .

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

Nov-18-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Qatar (0.14)
- North America
  - Canada > Quebec
    - Montreal (0.14)
  - United States > Oregon (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks > Deep Learning (0.48)
  - Statistical Learning > Gradient Descent (0.54)