Review for NeurIPS paper: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Jan-27-2025, 04:06:59 GMT–Neural Information Processing Systems

Summary and Contributions: This paper proposes to accelerate training of Transformer networks by progressively reducing Transformer layers from the network during training. First, it compares two different architectures of BERT, PostLN and PreLN. PostLN applies layer normalization after the element-wise addition in Transformer blocks. The PreLN changes the placement of the location of layer normalization by placing it only on the input stream of the sublayers. It finds that PostLN is more sensitive to the choice of hyperparameters, and training often diverges with more aggressive learning rates whereas PreLN avoids vanishing gradients and leads to more stable optimization.

accelerating training, progressive layer dropping, transformer-based language model, (9 more...)

Neural Information Processing Systems

Jan-27-2025, 04:06:59 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.40)
  - Machine Learning > Neural Networks
    - Deep Learning (0.40)