Why Do We Need Warm-up? A Theoretical Perspective

Alimisis, Foivos, Islamov, Rustem, Lucchi, Aurelien

Oct-6-2025–arXiv.org Machine Learning

Training modern machine learning models requires a careful choice of hyperparameters. A common practice for setting the learning rate (LR) is to linearly increase the LR in the beginning (warm-up stage) [Goyal et al., 2017, Vaswani et al., 2017] and gradually decrease at the end of the training (decay stage) [Loshchilov and Hutter, 2016, Vaswani et al., 2017, Hoffmann et al., 2022b, Zhang et al., 2023, Dremov et al., 2025]. Decaying the LR is a classical requirement in the theoretical analysis of SGD, ensuring convergence under broad conditions [Defazio et al., 2023, Gower et al., 2021], and it has been consistently observed to improve empirical performance [Loshchilov and Hutter, 2016, Hu et al., 2024, Hägele et al., 2024]. Recent work further demonstrates that decaying step sizes can improve theoretical guarantees by yielding tighter bounds [Schaipp et al., 2025]. By contrast, the practice of linearly increasing the LR at the start of training (warm-up phase) has become nearly ubiquitous in modern deep learning [He et al., 2016, Hu et al., 2024, Hägele et al., 2024], yet a clear theoretical understanding of why it helps optimization remains elusive. This raises the central question we address in this paper: Why does LR warm-up improve training, and under what conditions can its benefits be theoretically justified?

dist, vec, warm-up, (15 more...)

arXiv.org Machine Learning

Oct-6-2025

arXiv.org PDF

Add feedback

Country:
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Switzerland > Basel-City
    - Basel (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China (0.04)

Genre:
- Research Report (0.63)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.92)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)