The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models

Open in new window