The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Wang, Jinbo, Wang, Mingze, Zhou, Zhanpeng, Yan, Junchi, E, Weinan, Wu, Lei

Feb-26-2025–arXiv.org Machine Learning

Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 1.1B and datasets of OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.

adamw, arxiv preprint arxiv, blockwise lr, (13 more...)

arXiv.org Machine Learning

Feb-26-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Singapore (0.04)
  - Middle East > Jordan (0.04)
  - Indonesia > Bali (0.04)
  - China
    - Shanghai > Shanghai (0.04)
    - Beijing > Beijing (0.04)

Genre:
- Research Report > New Finding (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found