Power Lines: Scaling Laws for Weight Decay and Batch Size in LLMPre-training

Open in new window