SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training

Neural Information Processing Systems 

Data parallelism across multiple machines is widely adopted for accelerating distributed deep learning, but it is hard to achieve linear speedup due to the heavy communication. In this paper, we propose SAPipe, a performant system that pushes the training speed of data parallelism to its fullest extent.