blockpruner
Accurate Sublayer Pruning for Large Language Models by Exploiting Latency and Tunability Information
Park, Seungcheol, Lee, Sojin, Kim, Jongjin, Lee, Jinsik, Jo, Hyunjik, Kang, U
How can we accelerate large language models(LLMs) without sacrificing accuracy? The slow inference speed of LLMs hinders us to benefit from their remarkable performance in diverse applications. This is mainly because numerous sublayers are stacked together in LLMs. Sublayer pruning compresses and expedites LLMs via removing unnecessary sublayers. However, existing sublayer pruning algorithms are limited in accuracy since they naively select sublayers to prune, overlooking the different characteristics of each sublayer. In this paper, we propose SPRINT (Sublayer PRuning wIth LateNcy and Tunability Information), an accurate sublayer pruning method for LLMs. SPRINT accurately selects a target sublayer to prune by considering 1) the amount of latency reduction after pruning and 2) the tunability of sublayers. SPRINT iteratively prunes redundant sublayers and swiftly tunes the parameters of remaining sublayers. Experiments show that SPRINT achieves the best accuracy-speedup trade-off, exhibiting up to 23.88%p higher accuracy on zero-shot commonsense reasoning benchmarks compared to existing pruning algorithms.
MultiPruner: Balanced Structure Removal in Foundation Models
Muรฑoz, J. Pablo, Yuan, Jinjie, Jain, Nilesh
Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size, achieving results that outperform previous training-free pruning approaches. Motivated by these findings, we extend BlockPruner (Zhong et al., 2024) and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy. In MultiPruner, multidimensional pruning reinstates the structural balance in block-pruned models by sequentially compressing along three dimensions: i) residual blocks, ii) channels of multilayer perceptrons (MLP), and iii) attention heads. This solution enhances zero-shot accuracy on downstream tasks compared to other techniques while improving model compression ratios, producing compressed models with fewer computing and memory requirements. Extensive experiments demonstrate the advantages of the proposed method across various large pre-trained models. The code and pruning configurations are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
BlockPruner: Fine-grained Pruning for Large Language Models
Zhong, Longguang, Wan, Fanqi, Chen, Ruijun, Quan, Xiaojun, Li, Liangzhi
With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained Figure 1: Block Influence (BI) scores (Men et al., 2024) pruning can be achieved by targeting redundancies for the Llama2-7B model (Touvron et al., 2023b) computed in multi-head attention (MHA) and at both layer and block levels, where blocks/layers multi-layer perceptron (MLP) blocks. We propose with lower BI scores indicate less importance. The a novel, training-free structured pruning model has 32 Transformer layers, each containing one approach called BlockPruner. Unlike existing MHA and one MLP block, totaling 64 blocks. Blocklevel layer pruning methods, BlockPruner segments BI scores are generally lower than layer-level each Transformer layer into MHA and scores, indicating finer-grained redundancies.