SkyLadder: Better and Faster Pretraining via Context Window Scheduling
Zhu, Tongyao, Liu, Qian, Wang, Haonan, Chen, Shiqi, Gu, Xiangming, Pang, Tianyu, Kan, Min-Yen
–arXiv.org Artificial Intelligence
Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long-context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The evolution of language models has been marked by a consistent expansion in context window sizes (Figure 1 left). While early models like GPT (Radford, 2018) and BERT (Kenton & Toutanova, 2019) were limited to context windows of 512 tokens, subsequent models have pushed these boundaries significantly. GPT-2 (Radford et al., 2019) doubled this capacity to 1024 tokens, and with the advent of Large Language Models (LLMs) exceeding 1B parameters, the progression continued: Llama (Touvron et al., 2023a) implemented a 2048-token window, Llama-2 (Touvron et al., 2023b) extended it to 4096, and Llama-3 (Dubey et al., 2024) further expanded to 8192 tokens. The push to expand the context window is motivated by the need for models to handle longer sequences during inference. The development is also driven by a widespread belief that models pretrained with longer context windows should perform comparably to, or even surpass, their shorter context counterparts, as extended windows reduce document truncation and preserve coherence (Ding et al., 2024). We question whether the common belief that larger context windows does actually improve performance.
arXiv.org Artificial Intelligence
Mar-19-2025
- Country:
- Europe (1.00)
- North America > United States
- Minnesota > Hennepin County > Minneapolis (0.14)
- Technology: