The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Open in new window