Review for NeurIPS paper: Why are Adaptive Methods Good for Attention Models?
–Neural Information Processing Systems
Summary and Contributions: The paper studies the behavior of SGD, Adam, and SGD with clipping on the stochastic optimization problems with heavy-tailed stochastic gradients. First of all, the authors empirically establish that Adam outperforms SGD on the problems with heavy-tailed stochastic gradients. Next, they derive the convergence guarantees for clipped SGD for smooth non-convex under the assumption of the uniformly bounded central moment of order \alpha \in (1,2] of the gradient and non-smooth (authors claim that f should be L-smooth in the statement of the theorem, but do not use it in the proof) strongly convex problems under the assumption of the uniformly bounded moment of order \alpha \in (1,2] of the gradient. Interestingly, in these cases, SGD can diverge, which fits the empirical evidence that methods with clipping (or its adaptive variants) work better than SGD in the presence of heavy-tailed noise. Furthermore, the paper proposes lower bounds for these cases implying the optimality of clipped SGD.
Neural Information Processing Systems
Feb-11-2025, 23:35:32 GMT
- Technology: