TowardsTheoreticallyUnderstandingWhySGD GeneralizesBetterThanADAMinDeepLearning (SupplementaryFile)
–Neural Information Processing Systems
It is structured as follows. Appendix C summarizes the notations throughout this document and also provides the auxiliary theories and lemmas forsubsequent analysis whose proofs aredeferred toAppendix E.Then Appendix Dgivesthe proofs ofthe main results inSec. Then we introduce the two types of randomness in the SDE ofADAM. Finally, we run experiments to investigate the validity of the constructedSDEsofADAMandSGD. Here wefurther investigate the second-order moment of the gradient noise.
Neural Information Processing Systems
Feb-11-2026, 02:37:24 GMT