Computation and Memory-Efficient Model Compression with Gradient Reweighting

Jun-11-2026, 23:15:44 GMT–Neural Information Processing Systems

Pruning is a commonly employed technique for deep neural networks (DNNs) aiming at compressing the model size to reduce computational and memory costs during inference. In contrast to conventional neural networks, large language models (LLMs) pose a unique challenge regarding pruning efficiency due to their substantial computational and memory demands. Existing methods, particularly optimization-based ones, often require considerable computational resources in gradient estimation because they cannot effectively leverage weight sparsity of the intermediate pruned network to lower compuation and memory costs in each iteration. The fundamental challenge lies in the need to frequently instantiate intermediate pruned sub-models to achieve these savings, a task that becomes infeasible even for moderately sized neural networks. To this end, this paper proposes a novel pruning method for DNNs that is both computationally and memory-efficient.

large language model, machine learning, natural language, (10 more...)

Neural Information Processing Systems

Jun-11-2026, 23:15:44 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.95)
  - Machine Learning > Neural Networks
    - Deep Learning (0.38)