AITopics

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.67)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-11-2026, 03:28:44 GMT

LLM-Pruner: On the Structural Pruning of Large Language Models Xinyin Ma Gongfan Fang Xinchao Wang National University of Singapore

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Country:

Asia > Singapore (0.40)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Texas > Bexar County > San Antonio (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry: Consumer Products & Services > Restaurants (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsDec-24-2025, 22:27:57 GMT

LLM-Pruner: On the Structural Pruning of Large Language Models

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the deployment, inference, and training stages. With LLM being a general-purpose task solver, we explore its compression in a task-agnostic manner, which aims to preserve the multi-task solving and language generation ability of the original LLM. One challenge to achieving this is the enormous size of the training corpus of LLM, which makes both data transfer and model post-training over-burdensome. Thus, we tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-pruner, adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM's functionality. To this end, the performance of pruned models can be efficiently recovered through tuning techniques, LoRA, in merely 3 hours, requiring only 50K data. We validate the LLM-Pruner on three LLMs, including LLaMA, Vicuna, and ChatGLM, and demonstrate that the compressed models still exhibit satisfactory capabilities in zero-shot classification and generation. The code will be made public.

llm-pruner, name change, structural pruning, (3 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsOct-10-2025, 22:14:22 GMT

Search for Efficient Large Language Models

Large Language Models (LLMs) have long held sway in the realm s of artificial intelligence research. Numerous efficient techniques, inc luding weight pruning, quantization, and distillation, have been embraced to comp ress LLMs, targeting memory reduction and inference acceleration, which unders core the redundancy in LLMs. However, most model compression techniques concen trate on weight optimization, overlooking the exploration of optimal arch itectures. Besides, traditional architecture search methods, limited by the eleva ted complexity with extensive parameters, struggle to demonstrate their effecti veness on LLMs. In this paper, we propose a training-free architecture search fram ework to identify optimal subnets that preserve the fundamental strengths of the o riginal LLMs while achieving inference acceleration. Furthermore, after gen erating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inher ited weights with a small amount of calibration data. Compared with SOT A training-fr ee structured pruning works that can generate smaller networks, our method dem onstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve infer ence acceleration.

architecture, dataset, subnet, (14 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.67)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-8-2025, 14:06:11 GMT

44956951349095f74492a5471128a7e0-Paper-Conference.pdf

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > United States > Texas > Bexar County > San Antonio (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry: Consumer Products & Services > Restaurants (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceSep-18-2025

NIRVANA: Structured pruning reimagined for large language models compression

Ai, Mengting, Wei, Tianxin, Chen, Sirui, He, Jingrui

To address these critical shortcomings, we introduce NIRV ANA, a novel pruning method explicitly designed to balance immediate zero-shot accuracy preservation with robust fine-tuning capability. Transformer-based (V aswani et al., 2017) large language models (LLMs) have revolutionized natural To alleviate this critical bottleneck, model compression techniques--particularly pruning (LeCun et al., 1989)--emerge as an essential strategy, aiming to create lighter, more accessible models These two can also be applied for semi-structured pruning. This oversight often results in suboptimal pruning choices, impairing model performance. To address these critical gaps, we introduce NIRV ANA (NTK-InfoRmed adaptiVe neuron & AttentioN heAd pruning), a novel structured pruning method that tightly integrates pruning decisions with model fine-tuning dynamics through the lens of the Neural Tangent Kernel (NTK) (Jacot et al., 2018). An adaptive sparsity allocation strategy that dynamically adjusts pruning ratios across layers and modules, explicitly addressing overlooked disparities in existing pruning methodologies. Recent unstructured pruning methods, such as SparseGPT (Frantar and Alistarh, 2023) and Wanda (Sun et al., 2023), prune individual weights Semi-structured methods address this by imposing fixed patterns (e.g., 2:4 sparsity (Fang et al., 2024; Zheng et al., 2024)), yet still struggle to support efficient training and require specialized hardware. ShortGPT (Men et al., 2024) introduce global or layer-wise pruning strategies, yet do not explicitly SliceGPT (Ashkboos et al., 2024) applies PCA-based transformations per block, but remains highly sensitive to calibration data, reflecting a broader Table 4. Since most of the current LLMs are based on SwiGLU Shazeer (2020) structure, we focus Neural Tangent Kernel (NTK) (Jacot et al., 2018) provides a kernel-based framework for analyzing See the details of the derivation in Section A.6 3.2 P Consequently, popular practices include fixing the weights (i.e., setting In Llama3's implementation, which employs Grouped Query Attention (GQA), multiple query heads share Without loss of generality, our analysis can be extended to the vector-output case.

large language model, machine learning, pruning, (16 more...)

2509.1423

Country:

North America > United States (1.00)
Asia > Japan (0.97)

Genre: Research Report > New Finding (0.93)

Industry: Government > Regional Government > Asia Government > Japan Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMar-8-2025

Sample-aware Adaptive Structured Pruning for Large Language Models

Kong, Jun, Ma, Xinge, Wang, Jin, Zhang, Xuejie

Large language models (LLMs) have achieved outstanding performance in natural language processing, but enormous model sizes and high computational costs limit their practical deployment. Structured pruning can effectively reduce the resource demands for deployment by removing redundant model parameters. However, the randomly selected calibration data and fixed single importance estimation metrics in existing structured pruning methods lead to degraded performance of pruned models. This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for LLMs, aiming to optimize the calibration data and importance estimation metrics in the structured pruning process. Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space and then employing Bayesian optimization to adaptively search for the optimal calibration data and importance estimation metrics. Experimental results show that the AdaPruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20\% pruning ratio, the model pruned with AdaPruner maintains 97\% of the performance of the unpruned model.

calibration data, importance estimation, pruning ratio, (15 more...)

2503.06184

Country: Asia > China > Yunnan Province > Kunming (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsJan-14-2025, 22:23:23 GMT

LLM-Pruner: On the Structural Pruning of Large Language Models

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the deployment, inference, and training stages. With LLM being a general-purpose task solver, we explore its compression in a task-agnostic manner, which aims to preserve the multi-task solving and language generation ability of the original LLM. One challenge to achieving this is the enormous size of the training corpus of LLM, which makes both data transfer and model post-training over-burdensome. Thus, we tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-pruner, adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM's functionality.

language model, llm-pruner, structural pruning, (1 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceDec-16-2024

Numerical Pruning for Efficient Autoregressive Models

Shen, Xuan, Song, Zhao, Zhou, Yufa, Chen, Bo, Liu, Jing, Zhang, Ruiyi, Rossi, Ryan A., Tan, Hao, Yu, Tong, Chen, Xiang, Zhou, Yufan, Sun, Tong, Zhao, Pu, Wang, Yanzhi, Gu, Jiuxiang

Transformers have emerged as the leading architecture in deep learning, proving to be versatile and highly effective across diverse domains beyond language and image processing. However, their impressive performance often incurs high computational costs due to their substantial model size. This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning to improve the model efficiency while preserving performance for both language and image generation tasks. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and MLP modules, respectively. Besides, we further propose another compensation algorithm to recover the pruned model for better performance. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments. Our experiments show that our method achieves state-of-the-art performance with reduced memory usage and faster generation speeds on GPUs.

large language model, machine learning, natural language, (18 more...)

2412.12441

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Tennessee (0.04)
North America > United States > Pennsylvania (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

arXiv.org Artificial IntelligenceDec-9-2024

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

Wu, Haihang

Large language models (LLMs) have demonstrated remarkable performance across various language tasks, but their widespread deployment is impeded by their large size and high computational costs. Structural pruning is a prevailing technique used to introduce sparsity into pre-trained models and facilitate direct hardware acceleration during inference by removing redundant connections (structurally-grouped parameters), such as channels and attention heads. Existing structural pruning approaches often employ either global or layer-wise pruning criteria; however, they are hindered by ineffectiveness stemming from inaccurate evaluation of connection importance. Global pruning methods typically assess component importance using near-zero and unreliable gradients, while layer-wise pruning approaches encounter significant pruning error accumulation issues. To this end, we propose a more accurate pruning metric based on the block-wise importance score propagation, termed LLM-BIP. Specifically, LLM-BIP precisely evaluates connection importance by gauging its influence on the respective transformer block output, which can be efficiently approximated in a single forward pass through an upper bound derived from the assumption of Lipschitz continuity. We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks. The results demonstrate that our approach achieves an average of 3.26% increase in accuracy for common reasoning tasks compared to previous best baselines. It also reduces perplexity by 14.09 and 68.76 on average for the WikiText2 dataset and PTB dataset, respectively.

large language model, machine learning, pruning, (19 more...)