aac02401755a65904cf977a33136af4a-Supplemental-Conference.pdf
–Neural Information Processing Systems
Asdescribedinmainpaper,webelievethatthisisa limitation ofcommon gradient clipping technique: Although gradient clipping canavoidtoolarge gradient at every single step, it cannot avoid the gradient variance getting accumulated at certain dimensions (as shown in Figure 10(d)), especially for large batch sizes. Overall, this analysis demonstrates that proposed approach requires less or no tuning on gradient clipping, while baseline still has training stability issue with more gradient clipping. Figure 11: Validation perplexity andAdam variance norm/max element during GPT-2117M pretraining, comparing the baseline and proposed work (SLW) under different batch sizes/LR and sequencelengths.
Neural Information Processing Systems
Feb-11-2026, 06:57:36 GMT
- Technology: