An, Fengwei
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
Xu, Peng, Shao, Wenqi, Chen, Mengzhao, Tang, Shitao, Zhang, Kaipeng, Gao, Peng, An, Fengwei, Qiao, Yu, Luo, Ping
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at https://github.com/OpenGVLab/LLMPrune-BESA.
Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers
Dong, Pingcheng, Tan, Yonghao, Zhang, Dong, Ni, Tianwei, Liu, Xuejiao, Liu, Yu, Luo, Peng, Liang, Luhong, Liu, Shih-Yang, Huang, Xijie, Zhu, Huaiyu, Pan, Yun, An, Fengwei, Cheng, Kwang-Ting
The performance greatly benefits from the self-attention mechanism in Transformers, which could capture long-range dependencies Non-linear functions are prevalent in Transformers and their lightweight well, but with a substantial overhead in computation variants, incurring substantial and frequently underestimated and memory. Extensive research has been conducted to facilitate the hardware costs. Previous state-of-the-art works optimize deployment of Transformers on edge devices. Techniques like lightweight these operations by piece-wise linear approximation and store the structure integrating convolution and linear attention [4, 5] parameters in look-up tables (LUT), but most of them require unfriendly emerge, while quantization [6-8] and run-time pruning [9] has become high-precision arithmetics such as FP/INT 32 and lack consideration favored approaches to further reduced the hardware burden. of integer-only INT quantization. This paper proposed a However, the optimization of non-linear operations is frequently genetic LUT-Approximation algorithm namely GQA-LUT that can neglected in Transformer-based models which can be costly due to automatically determine the parameters with quantization awareness.