AITopics | unstructured pruning

Collaborating Authors

unstructured pruning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MUSTAFAR: Promoting Unstructured Sparsity for KVCache Pruning in LLMInference

Neural Information Processing SystemsJun-18-2026, 07:22:33 GMT

We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache up to 45% of dense inference and thereby enables longer context lengths and increased tokens/sec throughput of up to 2.23 compared to dense inference.

large language model, machine learning, pruning, (22 more...)

Neural Information Processing Systems

Country:

North America > United States (0.67)
Europe (0.46)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Experimental Results of Pruning Plasticity

Neural Information Processing SystemsApr-25-2026, 22:00:54 GMT

We also studied pruning plasticity on structured pruning. In particular, we choose the filter pruning method used in Li et al. [32]. The pruning criterion is the absolute weight sum of each nonzero filter and the regeneration criterion is the absolute gradient sum of each zero filter. We first pre-train four sets of neural networks from scratch with various structured sparsity, including 0, 0.10, 0.50, and 0.70, noted as "Pre-trained Sparsity" in the figure title. To measure the plasticity of these pre-trained models, we choose four different pruning rates noted as "Pruning rate" to remove filters from these pre-trained models.

artificial intelligence, machine learning, sparsity, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

5227b6aaf294f5f027273aebf16015f2-Supplemental.pdf

Neural Information Processing SystemsFeb-19-2026, 02:42:07 GMT

artificial intelligence, granet, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.97)

Add feedback

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

Neural Information Processing SystemsFeb-11-2026, 22:56:31 GMT

One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Y et, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression.

large language model, machine learning, pruning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Italy (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Shaving WeightswithOccam'sRazor: BayesianSparsificationforNeuralNetworks usingtheMarginalLikelihood

Neural Information Processing SystemsFeb-10-2026, 06:43:17 GMT

Whilemuchwork has focused on different weight pruning criteria, the overallsparsifiabilityofthe network, i.e., its capacity to be pruned without quality loss, has often been overlooked.

machine learning, natural language, pruning, (19 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Wisconsin (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.95)

Add feedback

Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference

Joo, Donghyeon, Hosseini, Helya, Hadidi, Ramyad, Asgari, Bahar

arXiv.org Artificial IntelligenceNov-7-2025

We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache upto 45% of dense inference and thereby enables longer context length and increased tokens/sec throughput of upto 2.23x compared to dense inference. Our pruning mechanism and sparse attention kernel is available at https://github.com/dhjoo98/mustafar.

large language model, machine learning, pruning, (21 more...)

arXiv.org Artificial Intelligence

2505.22913

Country:

North America > United States (0.46)
Asia (0.28)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

Neural Information Processing SystemsOct-10-2025, 00:36:31 GMT

arxiv preprint arxiv, pruning, sparsity level, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Italy (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

2c572cad9ae98c5cb6f3fca040b2bc54-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 21:58:16 GMT

approximation, pruning, sparsity, (13 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Wisconsin (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

SInGE: Sparsity via Integrated Gradients Estimation of Neuron Relevance

Neural Information Processing SystemsAug-19-2025, 14:33:23 GMT

However it often comes at a computational price which may hinder their deployment.

artificial intelligence, machine learning, pruning, (18 more...)

Neural Information Processing Systems

Country: Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Mosaic: Composite Projection Pruning for Resource-efficient LLMs

Eccles, Bailey J., Wong, Leon, Varghese, Blesson

arXiv.org Artificial IntelligenceAug-14-2025

Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning - the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19x faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models. Mosaic is available for public use from https://github.com/blessonvar/Mosaic

large language model, machine learning, pruning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.future.2025.108056

2504.06323

Country: Europe > United Kingdom (0.46)

Genre: Research Report (1.00)

Industry: Information Technology (0.68)

Technology: