A Survey on LLM Mid-Training

Tu, Chengying, Zhang, Xuemiao, Weng, Rongxiang, Li, Rumei, Zhang, Chen, Bai, Yang, Yan, Hongfei, Wang, Jingang, Cai, Xunliang

arXiv.org Artificial Intelligence 

Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs. The paradigm shift in foundation model development has transitioned from monolithic pre-training approaches to sophisticated multi-stage optimization frameworks (Ibrahim et al., 2024; Blakeney et al., 2024; Feng et al., 2024; Zhang et al., 2025a;b). While general pre-training establishes fundamental competencies through exposure to diverse large-scale corpora, contemporary research demonstrates that subsequent optimization phases systematically amplify specialized capabilities like mathematics, reasoning, coding, agent, and long-context extension (Grattafiori et al., 2024; Parmar et al., 2024; OLMo et al., 2025). This evolution reflects a growing consensus that general pre-training may not effectively or sufficiently cultivate the capabilities required in specialized domains, particularly those that demand sustained access to high-quality data sources. The demonstrated potential of intermediate optimization phases has catalyzed their formalization as a distinct developmental stage, which is now gradually being recognized as the mid-training stage. Mid-training is positioned as the critical bridge between general pre-training and post-training stages, characterized by intermediate computational demands and targeted large-scale data utilization. The mid-training stage has proven its capacity for bidirectional capability balance: forward-propagating specialized capabilities potential through curriculum-guided exposure to domain-specific data, while simultaneously backward-preserving general competencies via a reserved general data ratio. While pre-training focuses on establishing foundational capabilities, mid-training aims to preserve these foundations while amplifying targeted competencies.