Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function
Optimization is a central ingredient of machine learning. First-order optimization algorithms, for instance, are particularly popular for deep learning tasks due to their scalabilities to highdimensional problems, because they employ gradient but not higher-order information of objective functions for iteratively approximating minimizers. Among first-order methods, arguably the most used is gradient descent method (GD), or rather one of its variants, stochastic gradient descent method (SGD). Designed for objective functions that sum a large amount of terms, which for instance can originate from big data, SGD introduces a randomization mechanism of gradient subsampling to improve the scalability of GD (e.g., Zhang [2004], Moulines and Bach [2011], Roux et al. [2012]). Consequently, the iteration of SGD, unlike GD, is not deterministic even when it is started at a fixed initial condition.
Feb-14-2020
- Country:
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Russia > Central Federal District
- Moscow Oblast > Moscow (0.04)
- United Kingdom > England
- Asia
- China (0.04)
- Russia (0.04)
- Middle East
- Europe
- Genre:
- Research Report (0.82)
- Technology: