Large-Scale Diverse Synthesis for Mid-Training
Zhang, Xuemiao, Tu, Chengying, Ren, Can, Weng, Rongxiang, Yan, Hongfei, Wang, Jingang, Cai, Xunliang
–arXiv.org Artificial Intelligence
The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to mitigate difficulty degradation; (3) refines answers via DeepSeek-V3 to improve output quality. We utilize BoostQA in mid-training, a mid-stage between pre-training and post-training, to optimize domain-specific knowledge acquisition and enhance data quality. Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of 12.74% on MMLU and CMMLU and establish SOT A average performance across 12 benchmarks. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.
arXiv.org Artificial Intelligence
Aug-5-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- North America > United States (0.14)
- Asia > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Banking & Finance > Loans
- Mortgages (0.46)
- Education > Educational Setting (0.93)
- Health & Medicine > Therapeutic Area
- Information Technology (0.67)
- Banking & Finance > Loans
- Technology: