Appendix: On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them

Apr-24-2026, 06:50:14 GMT–Neural Information Processing Systems

Suppose we have a non-zero solution θ which is a stationary point of f(θ,t) at t-th step and SGD finds θt = θ at t-th step. Theorem 2.2 of Shapiro and Wardi [9] told us that the learning rate should be small enough for convergence. Obviously, we have η < in practice. As ηt = ηt+1 does not hold, SGD cannot converging to any non-zero stationary point. The proof is now complete.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Apr-24-2026, 06:50:14 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Duplicate Docs Excel Report

Title
040d3b6af368bf71f952c18da5713b48-Supplemental-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found