GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

Open in new window