Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Luo, Yuqi, Song, Chenyang, Han, Xu, Chen, Yingfa, Xiao, Chaojun, Liu, Zhiyuan, Sun, Maosong

Nov-4-2024–arXiv.org Machine Learning

Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs), such as computation acceleration and model interpretability. Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-p% sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable. Activation sparsity refers to the phenomenon where considerable elements within the output of a neural layer (typically activation functions, as shown in Figure 1) are zero or low values and thus contribute weakly to the final model output given a specific input. Generally, a model with a greater sparsity ratio (i.e., the ratio of inactivated elements) has more potential in these scenarios.

activation sparsity, arxiv preprint arxiv, sparsity, (14 more...)

arXiv.org Machine Learning

Nov-4-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report (0.52)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found