Robust LLM Training Infrastructure at ByteDance
Wan, Borui, Liu, Gaohong, Song, Zuquan, Wang, Jun, Zhang, Yun, Sheng, Guangming, Wang, Shuguang, Wei, Houmin, Wang, Chenyuan, Lou, Weiqiang, Yang, Xi, Zhang, Mofan, Jiang, Kaihua, Ren, Cheng, Zhi, Xiaoyun, Yu, Menghan, Nan, Zhe, Zheng, Zhuolin, Zhong, Baoquan, Wang, Qinlong, Yu, Huan, Chi, Jinxin, Zhang, Wang, Li, Yuhan, Du, Zixian, Zhao, Sida, Zhang, Yongqiang, Tang, Jingzhe, Liu, Zherui, Wu, Chuan, Peng, Yanghua, Lin, Haibin, Xiao, Wencong, Liu, Xin, Xiang, Liang
–arXiv.org Artificial Intelligence
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- South Korea > Seoul
- Seoul (0.05)
- Europe
- North America
- Canada > Ontario (0.04)
- United States
- California
- San Diego County
- San Francisco County > San Francisco (0.14)
- Santa Clara County > Santa Clara (0.04)
- Colorado > Broomfield County
- Broomfield (0.04)
- Florida > Orange County
- Orlando (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- New York > New York County
- New York City (0.05)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Washington > King County
- Renton (0.04)
- California
- Asia
- Genre:
- Research Report (0.40)
- Industry:
- Energy (0.46)
- Information Technology (0.47)
- Technology: