Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers

Open in new window