The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models

Neural Information Processing Systems 

Recent works have demonstrated great success in pre-training large-scale autore-gressive language models (e.g., GPT -3) on massive GPUs.