Q1: Both reviewer # 4 and reviewer # 5 think it is essential to compare the proposed method with Pre-LayerNorm

Neural Information Processing Systems 

Q1: Both reviewer #4 and reviewer #5 think it is essential to compare the proposed method with Pre-LayerNorm. We added additional experiments to investigate the question on how PLD compares with PreLN? GLUE score (80.2) compared with Post-LN (82.1) on downstream tasks. When trained with the large learning rate as PLD, PreLN's Q2: Reviewer #3, #4, #5 ask about a comparison to simpler and alternative schedules. The current schedule is actually simple.