Reviews: Ouroboros: On Accelerating Training of Transformer-Based Language Models
–Neural Information Processing Systems
This paper studies the problem of parallelising large transformer-based language models. It goes beyond data parallelism in that it focuses on splitting the model when it does not fit in the memory of a single GPU. The idea is to segment the model into groups such that GPUs do not sit around waiting on others to pass gradients ( this is the case for layer-wise parallel solutions where each layer is on its own GPU). The model then allows backpropagation to use stale gradients between groups. An L-layer network is split into K modules so that the weights of the network are divided into K groups and each group is placed on a GPU.
Neural Information Processing Systems
Jan-22-2025, 01:42:04 GMT
- Technology: