Reviews: Ouroboros: On Accelerating Training of Transformer-Based Language Models

Neural Information Processing Systems 

This paper studies the problem of parallelising large transformer-based language models. It goes beyond data parallelism in that it focuses on splitting the model when it does not fit in the memory of a single GPU. The idea is to segment the model into groups such that GPUs do not sit around waiting on others to pass gradients ( this is the case for layer-wise parallel solutions where each layer is on its own GPU). The model then allows backpropagation to use stale gradients between groups. An L-layer network is split into K modules so that the weights of the network are divided into K groups and each group is placed on a GPU.