Goto

Collaborating Authors

 preln



Q1: Both reviewer # 4 and reviewer # 5 think it is essential to compare the proposed method with Pre-LayerNorm

Neural Information Processing Systems

Q1: Both reviewer #4 and reviewer #5 think it is essential to compare the proposed method with Pre-LayerNorm. We added additional experiments to investigate the question on how PLD compares with PreLN? GLUE score (80.2) compared with Post-LN (82.1) on downstream tasks. When trained with the large learning rate as PLD, PreLN's Q2: Reviewer #3, #4, #5 ask about a comparison to simpler and alternative schedules. The current schedule is actually simple.


A training

Neural Information Processing Systems

Table 4 describes the hyperparameters for pre-training the baseline and PLD. Eqn. 5 indicates that the gradient Figure 1 shows the full comparison of the baseline and PLD, fine-tuned at different checkpoints. Specifically, the fine-tuning results are often much worse with a large learning rate. Figure 11: The fine-tuning results at different checkpoints.Figure 12: Convergence curves varying the keep ratio ฮธ .



Q1: Both reviewer # 4 and reviewer # 5 think it is essential to compare the proposed method with Pre-LayerNorm

Neural Information Processing Systems

Q1: Both reviewer #4 and reviewer #5 think it is essential to compare the proposed method with Pre-LayerNorm. We added additional experiments to investigate the question on how PLD compares with PreLN? GLUE score (80.2) compared with Post-LN (82.1) on downstream tasks. When trained with the large learning rate as PLD, PreLN's Q2: Reviewer #3, #4, #5 ask about a comparison to simpler and alternative schedules. The current schedule is actually simple.


Review for NeurIPS paper: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Neural Information Processing Systems

Summary and Contributions: This paper proposes to accelerate training of Transformer networks by progressively reducing Transformer layers from the network during training. First, it compares two different architectures of BERT, PostLN and PreLN. PostLN applies layer normalization after the element-wise addition in Transformer blocks. The PreLN changes the placement of the location of layer normalization by placing it only on the input stream of the sublayers. It finds that PostLN is more sensitive to the choice of hyperparameters, and training often diverges with more aggressive learning rates whereas PreLN avoids vanishing gradients and leads to more stable optimization.