Power Lines: Scaling laws for weight decay and batch size in LLM pre-training

Open in new window