BlockPruner: Fine-grained Pruning for Large Language Models

Zhong, Longguang, Wan, Fanqi, Chen, Ruijun, Quan, Xiaojun, Li, Liangzhi

Jun-20-2024–arXiv.org Artificial Intelligence

With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained Figure 1: Block Influence (BI) scores (Men et al., 2024) pruning can be achieved by targeting redundancies for the Llama2-7B model (Touvron et al., 2023b) computed in multi-head attention (MHA) and at both layer and block levels, where blocks/layers multi-layer perceptron (MLP) blocks. We propose with lower BI scores indicate less importance. The a novel, training-free structured pruning model has 32 Transformer layers, each containing one approach called BlockPruner. Unlike existing MHA and one MLP block, totaling 64 blocks. Blocklevel layer pruning methods, BlockPruner segments BI scores are generally lower than layer-level each Transformer layer into MHA and scores, indicating finer-grained redundancies.

blockpruner, pruning, pruning ratio, (15 more...)

arXiv.org Artificial Intelligence

Jun-20-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)
    - Perceptrons (0.55)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found