Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
–Neural Information Processing Systems
Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem.
Neural Information Processing Systems
May-26-2025, 21:04:00 GMT
- Technology: