progressive layer dropping
Review for NeurIPS paper: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
Summary and Contributions: This paper proposes to accelerate training of Transformer networks by progressively reducing Transformer layers from the network during training. First, it compares two different architectures of BERT, PostLN and PreLN. PostLN applies layer normalization after the element-wise addition in Transformer blocks. The PreLN changes the placement of the location of layer normalization by placing it only on the input stream of the sublayers. It finds that PostLN is more sensitive to the choice of hyperparameters, and training often diverges with more aggressive learning rates whereas PreLN avoids vanishing gradients and leads to more stable optimization.
Review for NeurIPS paper: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
The proposed method for training BERT is practically useful. My main concern on this paper is that the novelty in this paper is somewhat limited. It combines two existing techniques. One is PreLN which has been well studied in the literature for training BERT, and the other is stochastically dropping layers which was first proposed for training CV models. On the other hand, how to effectively combine these two techniques and fine tune to make them work for training BERT needs certain efforts.
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current methods for accelerating the pre-training either rely on massive parallelism with advanced hardware or are not applicable to language models. In this work, we propose a method based on progressive layer dropping that speeds the training of Transformer-based language models, not at the cost of excessive hardware resources but from model architecture change and training technique boosted efficiency. Extensive experiments on BERT show that the proposed method achieves a 25% reduction of computation cost in FLOPS and a 24% reduction in the end-to-end wall-clock training time. Furthermore, we show that our pre-trained models are equipped with strong knowledge transferability, achieving similar or even higher accuracy in downstream tasks to baseline models.