On the Burstiness of Distributed Machine Learning Traffic

Luangsomboon, Natchanon, Fazel, Fahimeh, Liebeherr, Jörg, Sobhani, Ashkan, Guan, Shichao, Chu, Xingjun

arXiv.org Artificial Intelligence 

Traffic from distributed training of machine learning (ML) models makes up a large and growing fraction of the traffic mix in enterprise data centers. While work on distributed ML abounds, the network traffic generated by distributed ML has received little attention. Using measurements on a testbed network, we investigate the traffic characteristics generated by the training of the ResNet-50 neural network with an emphasis on studying its shortterm burstiness. For the latter we propose metrics that quantify traffic burstiness at different time scales. Our analysis reveals that distributed ML traffic exhibits a very high degree of burstiness on short time scales, exceeding a 60:1 peak-to-mean ratio on time intervals as long as 5 ms. We observe that training software orchestrates transmissions in such a way that burst transmissions from different sources within the same application do not result in congestion and packet losses. An extrapolation of the measurement data to multiple applications underscores the challenges of distributed ML traffic for congestion and flow control algorithms. This paper studies and analyzes the burstiness of traffic from training deep neural network (DNN) models as a root cause for short-lived surges of traffic, known as microbursts, that cause periods of high packet delay and loss in a data center network (DCN) even at a low utilization. Since microbursts occur at a time scale of less than a millisecond [1], traditional traffic control methods are not effective with avoiding packet losses in such scenarios. Research on microbursts in DCNs has suggested a range of potential root causes, including the inherent burstiness of application traffic, confluence of traffic flows to a common destination (fan-in, incast), offloading of protocol processing at hosts, and traffic control algorithms, such as packet scheduling and flow control [1]-[10]. While training of neural networks makes up a large fraction of the workload in data centers [11], to the best of our knowledge, there does not exist a detailed analysis of distributed ML traffic and its potential impact on the creation of microbursts. The vast majority of network traffic from training DNN models is due to the exchange of gradients of model parameters. As modern DNN models involve millions, and, in the case of large language models such as GPT, billions of parameters [12], the transmission of gradients creates huge data bursts. The measurement experiments are performed in a testbed network with a single switch with 100 Gbps line rates. We evaluate a server-based and a serverless mode of training. In server-based training, the nodes involved in the training, referred to as workers, exchange gradients with a dedicated server. Here, the transmissions to the server create a bottleneck.