Why Do We Need Warm-up? A Theoretical Perspective
Alimisis, Foivos, Islamov, Rustem, Lucchi, Aurelien
Training modern machine learning models requires a careful choice of hyperparameters. A common practice for setting the learning rate (LR) is to linearly increase the LR in the beginning (warm-up stage) [Goyal et al., 2017, Vaswani et al., 2017] and gradually decrease at the end of the training (decay stage) [Loshchilov and Hutter, 2016, Vaswani et al., 2017, Hoffmann et al., 2022b, Zhang et al., 2023, Dremov et al., 2025]. Decaying the LR is a classical requirement in the theoretical analysis of SGD, ensuring convergence under broad conditions [Defazio et al., 2023, Gower et al., 2021], and it has been consistently observed to improve empirical performance [Loshchilov and Hutter, 2016, Hu et al., 2024, Hägele et al., 2024]. Recent work further demonstrates that decaying step sizes can improve theoretical guarantees by yielding tighter bounds [Schaipp et al., 2025]. By contrast, the practice of linearly increasing the LR at the start of training (warm-up phase) has become nearly ubiquitous in modern deep learning [He et al., 2016, Hu et al., 2024, Hägele et al., 2024], yet a clear theoretical understanding of why it helps optimization remains elusive. This raises the central question we address in this paper: Why does LR warm-up improve training, and under what conditions can its benefits be theoretically justified?
- Country:
- Asia
- China (0.04)
- Middle East > Jordan (0.04)
- Europe
- Switzerland > Basel-City
- Basel (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Switzerland > Basel-City
- Asia
- Genre:
- Research Report (0.63)
- Technology: