a1140a3d0df1c81e24ae954d935e8926-Supplemental.pdf
–Neural Information Processing Systems
PL 1 i=l fRT(xi)) would be, and another term of E XL( Xl PL 1 i=l fRT(Xi)) that propagates through theTransformer blocks. Figure 1 shows the full comparison of the baseline and PLD, fine-tuned at different checkpoints. Overall, we observe that PLD not only trains BERT faster in pre-training but also preserves the performanceondownstreamtasks. Results are visualized in Figure 1, which shows that the baseline is less robust on the choice of learningrates.
Neural Information Processing Systems
Feb-9-2026, 15:05:13 GMT
- Technology: