Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

Neural Information Processing Systems 

Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found