Learning Rate Transfer in Normalized Transformers

Open in new window