On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions

Neural Information Processing Systems 

In this paper, we study Adam in non-convex smooth scenarios with potential unbounded gradients and affine variance noise. We consider a general noise model which governs affine variance noise, bounded noise, and sub-Gaussian noise. We show that Adam with a specific hyper-parameter setup can find a stationary point with a O (1 / T) rate in high probability under this general noise model where T denotes total number iterations, matching the lower rate of stochastic first-order algorithms up to logarithm factors.