Goto

Collaborating Authors

 training job


Appendix of A Deep Learning Dataloader with Shared Data Preparation

Neural Information Processing Systems

In this part, we show the I/O speed in the synchronous and asynchronous cases. Figure 3a show the I/O speed for four jobs that start at different moments. Then we further compare the RefCnt with the generic cache policy in the above cases. D = sample ([0, 13333], 10000) means sample a subset D with 10000 of size from [0, 13333] uniformly at random 36th Conference on Neural Information Processing Systems (NeurIPS 2022). DSA can always get the minimum misses.


6d538a6e667960b168d3d947eb6207a6-Paper-Conference.pdf

Neural Information Processing Systems

Prior work tries to improve the sampling locality by enforcing all the training jobs loading the same dataset in the same order and pace. However, such a solution isonly efficient under strong constraints: alljobs are trained onthe same dataset with the same starting moment and training speed. In this paper, we propose a new data loading method for efficiently training parallel DNNs with much flexible constraints. Our method is still highly efficient when different training jobs use different but overlapped datasets and have different starting moments andtrainingspeeds.


A Deep Learning Dataloader with Shared Data Preparation

Neural Information Processing Systems

Executing a family of Deep Neural Networks (DNNs) training jobs on the same or similar datasets in parallel is typical in current deep learning scenarios. It is time-consuming and resource-intensive because each job repetitively prepares (i.e., loads and preprocesses) the data independently, causing redundant consumption of I/O and computations. Although the page cache or a centralized cache component can alleviate the redundancies by reusing the data prep work, each job's data sampled uniformly at random presents a low sampling locality in the shared dataset that causes the heavy cache thrashing. Prior work tries to solve the problem by enforcing all training jobs iterating over the dataset in the same order and requesting each data in lockstep, leading to strong constraints: all jobs must have the same dataset and run simultaneously. In this paper, we propose a dependent sampling algorithm (DSA) and domain-specific cache policy to relax the constraints. Besides, a novel tree data structure is designed to efficiently implement DSA. Based on the proposed technologies, we implemented a prototype system, named Joader, which can share data prep work as long as the datasets share partially. We evaluate the proposed Joader in practical scenarios, showing a greater versatility and superiority over training speed improvement (up to 500% in ResNet18).


Preparation Meets Opportunity: Enhancing Data Preprocessing for ML Training With Seneca

Desai, Omkar, Jiao, Ziyang, Pei, Shuyi, Bhimani, Janki, Kim, Bryan S.

arXiv.org Artificial Intelligence

Input data preprocessing is a common bottleneck when concurrently training multimedia machine learning (ML) models in modern systems. To alleviate these bottlenecks and reduce the training time for concurrent jobs, we present Seneca, a data loading system that optimizes cache partitioning and data sampling for the data storage and ingestion (DSI) pipeline. The design of Seneca contains two key techniques. First, Seneca uses a performance model for the data pipeline to optimally partition the cache for three different forms of data (encoded, decoded, and augmented). Second, Seneca opportunistically serves cached data over uncached ones during random batch sampling so that concurrent jobs benefit from each other. We implement Seneca by modifying PyTorch and demonstrate its effectiveness by comparing it against several state-of-the-art caching systems for DNN training. Seneca reduces the makespan by 45.23% compared to PyTorch and increases data processing throughput by up to 3.45x compared to the next best dataloader.


Robust LLM Training Infrastructure at ByteDance

Wan, Borui, Liu, Gaohong, Song, Zuquan, Wang, Jun, Zhang, Yun, Sheng, Guangming, Wang, Shuguang, Wei, Houmin, Wang, Chenyuan, Lou, Weiqiang, Yang, Xi, Zhang, Mofan, Jiang, Kaihua, Ren, Cheng, Zhi, Xiaoyun, Yu, Menghan, Nan, Zhe, Zheng, Zhuolin, Zhong, Baoquan, Wang, Qinlong, Yu, Huan, Chi, Jinxin, Zhang, Wang, Li, Yuhan, Du, Zixian, Zhao, Sida, Zhang, Yongqiang, Tang, Jingzhe, Liu, Zherui, Wu, Chuan, Peng, Yanghua, Lin, Haibin, Xiao, Wencong, Liu, Xin, Xiang, Liang

arXiv.org Artificial Intelligence

The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPUs.


Power Stabilization for AI Training Datacenters

Choukse, Esha, Warrier, Brijesh, Heath, Scot, Belmont, Luz, Zhao, April, Khan, Hassan Ali, Harry, Brian, Kappel, Matthew, Hewett, Russell J., Datta, Kushal, Pei, Yu, Lichtenberger, Caroline, Siegler, John, Lukofsky, David, Kahn, Zaid, Sahota, Gurpreet, Sullivan, Andy, Frederick, Charles, Thai, Hien, Naughton, Rebecca, Jurnove, Daniel, Harp, Justin, Carper, Reid, Mahalingam, Nithish, Varkala, Srini, Kumbhare, Alok Gautam, Desai, Satyajit, Ramamurthy, Venkatesh, Gottumukkala, Praneeth, Bhatia, Girish, Wildstone, Kelsey, Olariu, Laurentiu, Incorvaia, Ileana, Wetmore, Alex, Ram, Prabhat, Raghuraman, Melur, Ayna, Mohammed, Kendrick, Mike, Bianchini, Ricardo, Hurst, Aaron, Zamani, Reza, Li, Xin, Petrov, Michael, Oden, Gene, Carmichael, Rory, Li, Tom, Gupta, Apoorv, Patel, Pratikkumar, Dattani, Nilesh, Marwong, Lawrence, Nertney, Rob, Kobayashi, Hirofumi, Liott, Jeff, Enev, Miro, Ramakrishnan, Divya, Buck, Ian, Alben, Jonah

arXiv.org Artificial Intelligence

Large Artificial Intelligence (AI) training workloads spanning several tens of thousands of GPUs present unique power management challenges. These arise due to the high variability in power consumption during the training. Given the synchronous nature of these jobs, during every iteration there is a computation-heavy phase, where each GPU works on the local data, and a communication-heavy phase where all the GPUs synchronize on the data. Because compute-heavy phases require much more power than communication phases, large power swings occur. The amplitude of these power swings is ever increasing with the increase in the size of training jobs. An even bigger challenge arises from the frequency spectrum of these power swings which, if harmonized with critical frequencies of utilities, can cause physical damage to the power grid infrastructure. Therefore, to continue scaling AI training workloads safely, we need to stabilize the power of such workloads. This paper introduces the challenge with production data and explores innovative solutions across the stack: software, GPU hardware, and datacenter infrastructure. We present the pros and cons of each of these approaches and finally present a multi-pronged approach to solving the challenge. The proposed solutions are rigorously tested using a combination of real hardware and Microsoft's in-house cloud power simulator, providing critical insights into the efficacy of these interventions under real-world conditions.




Understanding Stragglers in Large Model Training Using What-if Analysis

Lin, Jinkun, Jiang, Ziheng, Song, Zuquan, Zhao, Sida, Yu, Menghan, Wang, Zhanghan, Wang, Chenyuan, Shi, Zuocheng, Shi, Xiang, Jia, Wei, Liu, Zherui, Wang, Shuguang, Lin, Haibin, Liu, Xin, Panda, Aurojit, Li, Jinyang

arXiv.org Artificial Intelligence

Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?


Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

Sultana, Abeda, Pakka, Nabin, Xu, Fei, Yuan, Xu, Chen, Li, Tzeng, Nian-Feng

arXiv.org Artificial Intelligence

Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources substantially under-utilized. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, {\em Hadar}, based on an optimization framework that can boost resource utilization. {\em Hadar} leverages the performance traits of DL jobs on a heterogeneous DL cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. %with the objective to reduce the average job completion time of DL jobs. It involves the primal-dual framework employing a dual subroutine, to solve the optimization problem and guide the scheduling design. Our trace-driven simulation with representative DL model training workloads demonstrates that {\em Hadar} accelerates the total time duration by 1.20$\times$ when compared with its state-of-the-art heterogeneity-aware counterpart, Gavel. Further, our {\em Hadar} scheduler is enhanced to {\em HadarE} by forking each job into multiple copies to let a job train concurrently on heterogeneous GPUs resided on separate available nodes (i.e., machines or servers) for resource utilization enhancement. {\em HadarE} is evaluated extensively on physical DL clusters for comparison with {\em Hadar} and Gavel. With substantial enhancement in cluster resource utilization (by 1.45$\times$), {\em HadarE} exhibits considerable speed-ups in DL model training, reducing the total time duration by 50\% (or 80\%) on an Amazon's AWS (or our lab) cluster, while producing trained DL models with consistently better inference quality than those trained by \textit{Hadar}.