aac02401755a65904cf977a33136af4a-Supplemental-Conference.pdf

Mar-27-2025, 11:28:35 GMT–Neural Information Processing Systems

Figure 7: Training loss, Adam variance norm/max element, and correlations between loss spikes and variance norm/max during GPT-2 pre-training (without the proposed method) under different model sizes, batch sizes (and LR), and sequence lengths. A.1 Zoom in of Figure 1 Figure 7 zoom in the first 30B token in main paper Figure 1, where the training is the most unstable. A.2 Learning rate decay for proposed approach As discussed in main paper Section 5.1 GPT-2 experiments, proposed approach needs more training steps than baseline in order to reach the same 157B training tokens. This makes it necessary to modify the learning rate decay schedule for proposed approach. We first tried to increase the number of learning rate decay steps by half of the proposed approach's pacing function duration T (since the proposed approach roughly needs T/2 additional steps to reach 157B tokens).

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Mar-27-2025, 11:28:35 GMT

Conferences PDF

Add feedback

Genre:
- Research Report (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.56)
  - Natural Language
    - Chatbot (0.56)
    - Large Language Model (0.72)