Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

Open in new window