Reviews: Ouroboros: On Accelerating Training of Transformer-Based Language Models

Jan-22-2025, 01:42:15 GMT–Neural Information Processing Systems

The paper introduces a new method for model-parallel training, where layers of a model are distributed across multiple accelerators. The method avoids locking in the backward pass by using stale gradients during back-propagation. I'm not aware of any prior work that took such an approach. Furthermore, the authors provide theoretical claims and empirical results to demonstrate that their method has convergence properties similar to conventional SGD, despite using stale gradients. The lack of effective model-parallel training is a major roadblock for scaling up model sizes, and the proposed approach promises to overcome this issue.

accelerating training, model-parallel training, transformer-based language model, (3 more...)

Neural Information Processing Systems

Jan-22-2025, 01:42:15 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.85)
  - Machine Learning > Neural Networks
    - Deep Learning (0.40)