Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

Open in new window