AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Lu, Haiquan, Zhou, Yefan, Liu, Shiwei, Wang, Zhangyang, Mahoney, Michael W., Yang, Yaoqing

arXiv.org Machine Learning 

Recent work on pruning large language models (LLMs) (Frantar and Alistarh, 2023a; Jaiswal et al., 2023; Sun et al., 2023) has shown the ability to reduce the number of parameters significantly, without compromising performance, resulting in notable savings in memory footprint, computing time, and energy consumption. Unlike pre-LLM pruning methods (Kurtic et al., 2022; Sanh et al., 2020), existing LLM pruning approaches typically allocate the "sparsity budget" (i.e., the number of pruned parameters or pruning ratios) uniformly across layers, making it difficult to increase sparsity to very high levels. Relatively little effort has been put into developing theoretically-principled ways to compute layerwise pruning ratios. For example, the Outlier Weighed Layerwise sparsity (OWL) method (Yin et al., 2023) uses a nonuniform layerwise sparsity based on the distribution of outlier activations. However, OWL relies on heuristics related to the presence of outliers (Dettmers et al., 2022; Kovaleva et al., 2021; Puccetti et al., 2022). This can lead to suboptimal performance in the absence of outliers, and this can make it difficult to achieve very aggressive levels of sparsity. For example, Yin et al. (2023) shows that pruning LLMs to 80% sparsity often significantly degrades the prediction performance of LLMs. First two authors contributed equally.