Why Transformers Need Adam: A Hessian Perspective
–Neural Information Processing Systems
CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists.
Neural Information Processing Systems
Feb-18-2026, 15:12:37 GMT
- Country:
- Asia > China
- Guangdong Province > Shenzhen (0.04)
- Hong Kong (0.04)
- Europe > Germany
- Bavaria > Upper Bavaria > Munich (0.04)
- Asia > China
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (0.92)
- Research Report
- Technology: