Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

Feb-14-2020–arXiv.org Machine Learning

Optimization is a central ingredient of machine learning. First-order optimization algorithms, for instance, are particularly popular for deep learning tasks due to their scalabilities to highdimensional problems, because they employ gradient but not higher-order information of objective functions for iteratively approximating minimizers. Among first-order methods, arguably the most used is gradient descent method (GD), or rather one of its variants, stochastic gradient descent method (SGD). Designed for objective functions that sum a large amount of terms, which for instance can originate from big data, SGD introduces a randomization mechanism of gradient subsampling to improve the scalability of GD (e.g., Zhang [2004], Moulines and Bach [2011], Roux et al. [2012]). Consequently, the iteration of SGD, unlike GD, is not deterministic even when it is started at a fixed initial condition.

chaos, invariant distribution, potential well, (16 more...)

arXiv.org Machine Learning

Feb-14-2020

arXiv.org PDF

Add feedback

Country:
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Russia > Central Federal District
    - Moscow Oblast > Moscow (0.04)
- Asia
  - China (0.04)
  - Russia (0.04)
  - Middle East
    - Jordan (0.04)
    - Israel (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found