AITopics | warmup

Collaborating Authors

warmup

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

Neural Information Processing SystemsMar-22-2026, 12:09:55 GMT

In modern deep learning, it is common to warm up the learning rate $\eta$, often by a linear schedule between $\eta_{\text{init}} = 0$ and a predetermined target $\eta_{\text{trgt}}$. In this paper, we show through systematic experiments with SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger $\eta_{\text{trgt}}$ by forcing the network to more well-conditioned areas of the loss landscape. The ability to handle larger target learning rates in turn makes hyperparameter tuning more robust while improving the final performance of the network. We uncover different regimes of operation during the warmup period, depending on whether the network training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how $\eta_{\text{init}}$ can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam, which provides benefits similar to warmup.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

ca98452d4e9ecbc18c40da2aa0da8b98-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 04:01:43 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

[R1/R2] Infinite width assumption: the infinite width assumption is needed due to the technical detail that the norm

Neural Information Processing SystemsFeb-14-2026, 19:36:35 GMT

We thank reviewers for their valuable comments. We respond to the main concerns below. Similar to that in Zhang et al. [31], we chose 10k block ResNet to stress the We will rephrase L243 to better express this. Derivative of weights depend on this term due to the chain rule. We will make this explicit in the revised manuscript.

artificial intelligence, assumption, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.31)

Add feedback

abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 14:39:18 GMT

batch size, gradient variance, simigrad, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > United States > Nevada > Washoe County > Reno (0.04)
(5 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

88ae6372cfdc5df69a976e893f4d554b-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 17:56:16 GMT

gradinit, initialization, variance, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Maryland (0.04)

Industry: Banking & Finance (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Neural Information Processing SystemsFeb-7-2026, 08:54:51 GMT

Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits.

large language model, machine learning, warmup, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Switzerland (0.04)
(2 more...)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training

Neural Information Processing SystemsDec-26-2025, 05:07:41 GMT

Various gradient compression algorithms have been proposed to alleviate the communication bottleneck in distributed learning, and they have demonstrated effectiveness in terms of high compression ratios and theoretical low communication complexity. However, when it comes to practically training modern deep neural networks (DNNs), these algorithms have yet to match the inference performance of uncompressed SGD-momentum (SGDM) and adaptive optimizers (e.g.,Adam). More importantly, recent studies suggest that these algorithms actually offer no speed advantages over SGDM/Adam when used with common distributed DNN training frameworks ( e.g., DistributedDataParallel (DDP)) in the typical settings, due to heavy compression/decompression computation or incompatibility with the efficient All-Reduce or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel 1-bit adaptive optimizer, dubbed *Bi*nary *r*andomization a*d*aptive optimiz*er* (**Birder**). The quantization of Birder can be easily and lightly computed, and it does not require warmup with its uncompressed version in the beginning. Also, we devise Hierarchical-1-bit-All-Reduce to further lower the communication volume. We theoretically prove that it promises the same convergence rate as the Adam. Extensive experiments, conducted on 8 to 64 GPUs (1 to 8 nodes) using DDP, demonstrate that Birder achieves comparable inference performance to uncompressed SGDM/Adam, with up to ${2.5 \times}$ speedup for training ResNet-50 and ${6.3\times}$ speedup for training BERT-Base. Code is publicly available at https://openi.pcl.ac.cn/c2net_optim/Birder.

birder, communication-efficient 1-bit adaptive optimizer, name change, (8 more...)

Neural Information Processing Systems

Genre: Research Report (0.95)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.59)

Add feedback

Fast Certified Robust Training with Short Warmup

Neural Information Processing SystemsNov-15-2025, 05:32:58 GMT

DNNs, such as adversarial training (Madry et al., 2018), provide no provable robustness guarantees, Both IBP and CROWN-IBP with loss fusion (Xu et al., 2020) have a per-batch training time For example, generalized CROWN-IBP in Xu et al. (2020) used 900 epochs for warmup and 2,000 He et al., 2015a), but prior works for certified training generally use weight initialization methods originally designed for standard DNN training, while certified training is essentially optimizing a different type of augmented network defined by robustness verification (Zhang et al., 2020). It can however hamper classification performance if too many neurons are dead.

arxiv preprint arxiv, initialization, neural network, (13 more...)

Neural Information Processing Systems

Country: