Echo: Simulating Distributed Training At Scale

Feng, Yicheng, Chen, Yuetao, Chen, Kaiwen, Li, Jingzong, Wu, Tianyuan, Cheng, Peng, Wu, Chuan, Wang, Wei, Ho, Tsung-Yi, Xu, Hong

Dec-16-2024–arXiv.org Artificial Intelligence

Simulation for managing the massive machine learning (ML) clusters is also useful (perhaps more so) to extrapolate beyond what is and large-scale distributed training jobs. In this paper, we currently available, which is paramount for strategic decision build Echo to tackle three key challenges in large-scale training making such as capacity planning [24, 38] that involve many simulation: (1) tracing the runtime training workloads at what-if questions with significant impact. For example, what each device in an ex-situ fashion so we can use a single device speed-up can be achieved by scaling the current cluster by a to obtain the actual execution graphs of 1K-GPU training, (2) factor of 3, or by increasing the network bandwidth by 2x? accurately estimating the collective communication without This also greatly facilitates the development of new optimizations, high overheads of discrete-event based network simulation, which only need to be prototyped on a small scale for and (3) accounting for the interference-induced computation the simulator to extrapolate its potential benefits on a large slowdown from overlapping communication and computation scale quantitatively.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Dec-16-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.81)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language > Large Language Model (0.68)
  - Communications > Networks (1.00)