Variance Control via Weight Rescaling in LLM Pre-training

Open in new window