What Happens During the Loss Plateau Understanding Abrupt Learning in Transformers

Open in new window