Why Transformers Need Adam: A Hessian Perspective

Neural Information Processing Systems 

CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists.