NIRVANA: Structured pruning reimagined for large language models compression

Ai, Mengting, Wei, Tianxin, Chen, Sirui, He, Jingrui

arXiv.org Artificial Intelligence 

To address these critical shortcomings, we introduce NIRV ANA, a novel pruning method explicitly designed to balance immediate zero-shot accuracy preservation with robust fine-tuning capability. Transformer-based (V aswani et al., 2017) large language models (LLMs) have revolutionized natural To alleviate this critical bottleneck, model compression techniques--particularly pruning (LeCun et al., 1989)--emerge as an essential strategy, aiming to create lighter, more accessible models These two can also be applied for semi-structured pruning. This oversight often results in suboptimal pruning choices, impairing model performance. To address these critical gaps, we introduce NIRV ANA (NTK-InfoRmed adaptiVe neuron & AttentioN heAd pruning), a novel structured pruning method that tightly integrates pruning decisions with model fine-tuning dynamics through the lens of the Neural Tangent Kernel (NTK) (Jacot et al., 2018). An adaptive sparsity allocation strategy that dynamically adjusts pruning ratios across layers and modules, explicitly addressing overlooked disparities in existing pruning methodologies. Recent unstructured pruning methods, such as SparseGPT (Frantar and Alistarh, 2023) and Wanda (Sun et al., 2023), prune individual weights Semi-structured methods address this by imposing fixed patterns (e.g., 2:4 sparsity (Fang et al., 2024; Zheng et al., 2024)), yet still struggle to support efficient training and require specialized hardware. ShortGPT (Men et al., 2024) introduce global or layer-wise pruning strategies, yet do not explicitly SliceGPT (Ashkboos et al., 2024) applies PCA-based transformations per block, but remains highly sensitive to calibration data, reflecting a broader Table 4. Since most of the current LLMs are based on SwiGLU Shazeer (2020) structure, we focus Neural Tangent Kernel (NTK) (Jacot et al., 2018) provides a kernel-based framework for analyzing See the details of the derivation in Section A.6 3.2 P Consequently, popular practices include fixing the weights (i.e., setting In Llama3's implementation, which employs Grouped Query Attention (GQA), multiple query heads share Without loss of generality, our analysis can be extended to the vector-output case.