Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
–Neural Information Processing Systems
Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks.
Neural Information Processing Systems
Nov-15-2025, 15:47:48 GMT
- Country:
- North America > Canada
- British Columbia (0.04)
- Ontario > Toronto (0.04)
- North America > Canada
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Technology: