Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Shi, Shaohuai, Zhou, Xianhao, Song, Shutao, Wang, Xingyao, Zhu, Zilin, Huang, Xue, Jiang, Xinan, Zhou, Feihu, Guo, Zhenyu, Xie, Liqiang, Lan, Rui, Ouyang, Xianbin, Zhang, Yan, Wei, Jieqian, Gong, Jing, Lin, Weiliang, Gao, Ping, Meng, Peng, Xu, Xiaomin, Guo, Chenyang, Yang, Bo, Chen, Zhibo, Wu, Yongjian, Chu, Xiaowen

Oct-20-2020–arXiv.org Artificial Intelligence

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

deep learning, gpus, neural network, (18 more...)

arXiv.org Artificial Intelligence

Oct-20-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)

Genre:
- Research Report (1.00)

Industry:
- Information Technology (0.48)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found