Q1: Both reviewer # 4 and reviewer # 5 think it is essential to compare the proposed method with Pre-LayerNorm
–Neural Information Processing Systems
Q1: Both reviewer #4 and reviewer #5 think it is essential to compare the proposed method with Pre-LayerNorm. We added additional experiments to investigate the question on how PLD compares with PreLN? GLUE score (80.2) compared with Post-LN (82.1) on downstream tasks. When trained with the large learning rate as PLD, PreLN's Q2: Reviewer #3, #4, #5 ask about a comparison to simpler and alternative schedules. The current schedule is actually simple.
Neural Information Processing Systems
Aug-15-2025, 12:38:21 GMT
- Technology: