BlockPruner: Fine-grained Pruning for Large Language Models
Zhong, Longguang, Wan, Fanqi, Chen, Ruijun, Quan, Xiaojun, Li, Liangzhi
–arXiv.org Artificial Intelligence
With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained Figure 1: Block Influence (BI) scores (Men et al., 2024) pruning can be achieved by targeting redundancies for the Llama2-7B model (Touvron et al., 2023b) computed in multi-head attention (MHA) and at both layer and block levels, where blocks/layers multi-layer perceptron (MLP) blocks. We propose with lower BI scores indicate less importance. The a novel, training-free structured pruning model has 32 Transformer layers, each containing one approach called BlockPruner. Unlike existing MHA and one MLP block, totaling 64 blocks. Blocklevel layer pruning methods, BlockPruner segments BI scores are generally lower than layer-level each Transformer layer into MHA and scores, indicating finer-grained redundancies.
arXiv.org Artificial Intelligence
Jun-20-2024