Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Open in new window