Appendix: On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them

Neural Information Processing Systems 

Suppose we have a non-zero solution θ which is a stationary point of f(θ,t) at t-th step and SGD finds θt = θ at t-th step. Theorem 2.2 of Shapiro and Wardi [9] told us that the learning rate should be small enough for convergence. Obviously, we have η < in practice. As ηt = ηt+1 does not hold, SGD cannot converging to any non-zero stationary point. The proof is now complete.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found