Scaling Law with Learning Rate Annealing

Neural Information Processing Systems 

We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps: L(s) = L0 +A S α1 C S2, where L(s)is the validation loss at step s, S1 is the area under the LR curve, S2 is the LR annealing area, and L0, A, C, αare constant parameters.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found