Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

Igor Colin, Ludovic DOS SANTOS, Kevin Scaman

Neural Information Processing Systems 

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective functions, we provide matching lower and upper complexity bounds and show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal.