Echo: Simulating Distributed Training At Scale
Feng, Yicheng, Chen, Yuetao, Chen, Kaiwen, Li, Jingzong, Wu, Tianyuan, Cheng, Peng, Wu, Chuan, Wang, Wei, Ho, Tsung-Yi, Xu, Hong
–arXiv.org Artificial Intelligence
Simulation for managing the massive machine learning (ML) clusters is also useful (perhaps more so) to extrapolate beyond what is and large-scale distributed training jobs. In this paper, we currently available, which is paramount for strategic decision build Echo to tackle three key challenges in large-scale training making such as capacity planning [24, 38] that involve many simulation: (1) tracing the runtime training workloads at what-if questions with significant impact. For example, what each device in an ex-situ fashion so we can use a single device speed-up can be achieved by scaling the current cluster by a to obtain the actual execution graphs of 1K-GPU training, (2) factor of 3, or by increasing the network bandwidth by 2x? accurately estimating the collective communication without This also greatly facilitates the development of new optimizations, high overheads of discrete-event based network simulation, which only need to be prototyped on a small scale for and (3) accounting for the interference-induced computation the simulator to extrapolate its potential benefits on a large slowdown from overlapping communication and computation scale quantitatively.
arXiv.org Artificial Intelligence
Dec-16-2024