Reviews: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Neural Information Processing Systems 

Originality * Their proposed algorithm has little in the way of surprising conceptual insights, but in this case that is a good thing - the parallelism algorithm is simple, intuitive, and achieves nearly linear throughput increase in the number of accelerators used (hard to expect much more). It's surprising that this hasn't been done before, given that it's such a simple trick that gives such a large speed up for training distributed models. To train their very deep models, the authors also use another few smaller tricks (e.g., clipping logits to mitigate bad gradients) Significance & Quality * General-purpose model parallelism algorithm: The proposed algorithm is applicable to almost any neural network architecture without modification; the authors demonstrate this feature by scaling up state-of-the-art architectures in both computer vision and NLP settings. In machine translation for low-resource languages, these gains seem quite substantial. Clarity: Clearly written, aided by the simplicity of the algorithm.