On the Distributional Properties of Adaptive Gradients
However, not much is known In this work, we take the first step for studying a rather about the mathematical and statistical properties of fundamental problem in the study of adaptive gradients; we this family of methods. This work aims at providing propose to study the distributional properties of the update a series of theoretical analyses of its statistical in the adaptive gradient method. The most closely related properties justified by experiments. In particular, previous work is [Liu et al., 2019]. The difference is that this we show that when the underlying gradient obeys work goes much deeper into the detail in the theoretical analysis a normal distribution, the variance of the magnitude and contradicts the results in [Liu et al., 2019]. The main of the update is an increasing and bounded contributions of this work are the following: (1) We prove function of time and does not diverge. This work that the variance of the adaptive gradient method is always finite suggests that the divergence of variance is not the (Proposition 1), which contradicts the result in Liu et al. cause of the need for warm up of the Adam optimizer, [2019]; this proof does not make any assumption regarding contrary to what is believed in the current the distribution of the gradient.
May-15-2021
- Country:
- Asia (0.14)
- North America > United States
- Louisiana (0.14)
- Genre:
- Research Report (0.50)
- Workflow (0.48)
- Technology: