Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
Ramachandran, Akshat, Kundu, Souvik, Raha, Arnab, Kundu, Shamik, Mathaikutty, Deepak K., Krishna, Tushar
–arXiv.org Artificial Intelligence
--Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. On the contrary, support for more than one N:M pattern to provide sparse representational freedom yields a costly overhead in the hardware. T o mitigate these challenges for LLMs, we first present a f lexible l ayer-wise o utlier-density-a ware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. T o deploy the sparse models with such N:M flexibility, we then present a flex ible low overhead, digital c ompute-i n-m emory architecture (FlexCiM). FlexCiM enables support for diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different values of N and M. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate FLOW to outperform existing alternatives with an accuracy improvement of up to 36%, while FlexCiM delivers up to 1.75 lower inference latency and 1.5 lower energy consumption compared to existing sparse accelerators. To reduce the colossal size of large language models (LLMs) and enable their efficient deployment on resource-constrained devices, post-training pruning has emerged as an effective model compression method [9], [33], [37]. It reduces the memory footprint of the pre-trained LLMs by removing ineffectual model parameters, at the granularity of individual weights ( unstructured) or blocks of weights ( structured), and storing sparse tensors in a compressed format (CSR/CSC) [14]. Notably, model pruning may yield compute acceleration via skipping ineffectual computations associated with the zero-valued weight/activation. However, traditional weight pruning often requires fine-tuning, which becomes exceedingly compute-heavy for LLMs. Furthermore, this often requires the model to yield structured pruned weights, which can cause a high accuracy drop compared to the models pruned via an unstructured approach.
arXiv.org Artificial Intelligence
Apr-22-2025
- Country:
- North America > United States (0.04)
- Genre:
- Research Report (0.82)
- Industry:
- Energy (0.34)
- Semiconductors & Electronics (0.46)
- Technology: