Towards Theoretically Understanding Why S GD Generalizes Better Than A DAM in Deep Learning Pan Zhou