aac02401755a65904cf977a33136af4a-Supplemental-Conference.pdf
–Neural Information Processing Systems
Figure 7: Training loss, Adam variance norm/max element, and correlations between loss spikes and variance norm/max during GPT-2 pre-training (without the proposed method) under different model sizes, batch sizes (and LR), and sequence lengths. A.1 Zoom in of Figure 1 Figure 7 zoom in the first 30B token in main paper Figure 1, where the training is the most unstable. A.2 Learning rate decay for proposed approach As discussed in main paper Section 5.1 GPT-2 experiments, proposed approach needs more training steps than baseline in order to reach the same 157B training tokens. This makes it necessary to modify the learning rate decay schedule for proposed approach. We first tried to increase the number of learning rate decay steps by half of the proposed approach's pacing function duration T (since the proposed approach roughly needs T/2 additional steps to reach 157B tokens).
Neural Information Processing Systems
Mar-27-2025, 11:28:35 GMT