Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Neural Information Processing Systems 

Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits.