Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects
Xiao, Ke Liang, Marshall, Noah, Agarwala, Atish, Paquette, Elliot
The success of deep learning has been driven by the effectiveness of relatively simple stochastic optimization algorithms. Stochastic gradient descent ( SGD) with momentum can be used to train models like ResNet50 with minimal hyperparameter tuning. The workhorse of modern machine learning is Adam, which was designed to give an approximation of preconditioning with a diagonal, online approximation of the Fisher information matrix (Kingma, 2014). Additional hypotheses for the success of Adam include its ability to maintain balanced updates to parameters across layers and its potential noise-mitigating effects (Zhang et al., 2020; 2024). Getting a quantitative, theoretical understanding of Adam and its variants is hindered by their complexity. While the multiple exponential moving averages are easy to implement, they complicate analysis. The practical desire for simpler, more efficient learning algorithms as well as the theoretical desire for simpler models to analyze have led to a resurgence in the study of signSGD .
Nov-18-2024
- Country:
- Asia > Middle East
- Qatar (0.14)
- North America
- Canada > Quebec
- Montreal (0.14)
- United States > Oregon (0.14)
- Canada > Quebec
- Asia > Middle East
- Genre:
- Research Report (0.50)
- Technology: