a1140a3d0df1c81e24ae954d935e8926-Supplemental.pdf

Neural Information Processing Systems 

PL 1 i=l fRT(xi)) would be, and another term of E XL( Xl PL 1 i=l fRT(Xi)) that propagates through theTransformer blocks. Figure 1 shows the full comparison of the baseline and PLD, fine-tuned at different checkpoints. Overall, we observe that PLD not only trains BERT faster in pre-training but also preserves the performanceondownstreamtasks. Results are visualized in Figure 1, which shows that the baseline is less robust on the choice of learningrates.

Duplicate Docs Excel Report

Similar Docs  Excel Report  more

TitleSimilaritySource
None found