Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment
Zhang, Shiwei, Yi, Xiaodong, Diao, Lansong, Wu, Chuan, Wang, Siyu, Lin, Wei
–arXiv.org Artificial Intelligence
Abstract--This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device-and topology-heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning. These Deep learning (DL) has powered a wide range of applications decisions jointly form an exponentially large strategy space. in various areas including computer vision [1], [2], natural Current practice often falls back to heuristics that consider language processing [3], [4], recommendation systems [5], one aspect of the strategy space at a time [17], [18], resulting etc. Recent deep neural network (DNN) models feature a in less efficient or even infeasible solutions. BERT [6] with more than Pioneering works on deploying DNN models onto heterogeneous 340M parameters) to achieve superior performance [3], [6]. However, their models do not generalize these models. This makes them homogeneous cluster, e.g., training Bert using 8 NVIDIA impractical for AI clouds, where new resource configurations V100 GPUs [7].
arXiv.org Artificial Intelligence
Feb-13-2023