A Survey on LLM Mid-Training

Tu, Chengying, Zhang, Xuemiao, Weng, Rongxiang, Li, Rumei, Zhang, Chen, Bai, Yang, Yan, Hongfei, Wang, Jingang, Cai, Xunliang

Nov-5-2025–arXiv.org Artificial Intelligence

Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs. The paradigm shift in foundation model development has transitioned from monolithic pre-training approaches to sophisticated multi-stage optimization frameworks (Ibrahim et al., 2024; Blakeney et al., 2024; Feng et al., 2024; Zhang et al., 2025a;b). While general pre-training establishes fundamental competencies through exposure to diverse large-scale corpora, contemporary research demonstrates that subsequent optimization phases systematically amplify specialized capabilities like mathematics, reasoning, coding, agent, and long-context extension (Grattafiori et al., 2024; Parmar et al., 2024; OLMo et al., 2025). This evolution reflects a growing consensus that general pre-training may not effectively or sufficiently cultivate the capabilities required in specialized domains, particularly those that demand sustained access to high-quality data sources. The demonstrated potential of intermediate optimization phases has catalyzed their formalization as a distinct developmental stage, which is now gradually being recognized as the mid-training stage. Mid-training is positioned as the critical bridge between general pre-training and post-training stages, characterized by intermediate computational demands and targeted large-scale data utilization. The mid-training stage has proven its capacity for bidirectional capability balance: forward-propagating specialized capabilities potential through curriculum-guided exposure to domain-specific data, while simultaneously backward-preserving general competencies via a reserved general data ratio. While pre-training focuses on establishing foundational capabilities, mid-training aims to preserve these foundations while amplifying targeted competencies.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Nov-5-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.93)
- Europe > Austria (0.28)
- North America > Mexico (0.28)

Genre:
- Overview (0.68)
- Research Report (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found