YuLan-Mini: An Open Data-efficient Language Model
Hu, Yiwen, Song, Huatong, Deng, Jia, Wang, Jiapeng, Chen, Jie, Zhou, Kun, Zhu, Yutao, Jiang, Jinhao, Dong, Zican, Zhao, Wayne Xin, Wen, Ji-Rong
–arXiv.org Artificial Intelligence
Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase.
arXiv.org Artificial Intelligence
Dec-24-2024
- Country:
- Asia (1.00)
- Europe (1.00)
- North America > United States (1.00)
- Genre:
- Instructional Material > Course Syllabus & Notes (0.46)
- Research Report > New Finding (0.67)
- Industry:
- Education > Educational Setting > K-12 Education (0.67)
- Technology: