Appendix: On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them Zeke Xie

Neural Information Processing Systems 

Obviously, we have η < in practice. The proof is now complete. Let SGD optimize L for t + 1 iterations. Introducing the derived conditions Eq. (12) - (16) for The batch size is set to 128 for both CIFAR-10 and CIFAR-100. We repeated each experiment for three times in the presence of the error bars.