Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

May-26-2025, 21:04:00 GMT–Neural Information Processing Systems

Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem.

adam outperform gradient descent, artificial intelligence, machine learning, (3 more...)

Neural Information Processing Systems

May-26-2025, 21:04:00 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)