SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs

Mozaffari, Mohammad, Dehnavi, Maryam Mehri

arXiv.org Artificial Intelligence 

Large Language Models (LLMs) have revolutionized natural language understanding and generation tasks but suffer from high memory consumption and slow inference times due to their large parameter sizes. Traditional model compression techniques, such as quantization and pruning, mitigate these issues but often require retraining to maintain accuracy, which is computationally expensive. Our method reduces quantization error while leveraging sparse representations compatible with accelerated hardware architectures. Additionally, we propose a parameter-efficient fine-tuning recipe that significantly reduces overhead compared to conventional quantizationaware training. M achieves up to a 5.4% improvement in model accuracy for sparsity patterns like 2:4, and the fine-tuning step further enhances accuracy by up to 5.8%, demonstrating state-of-the-art performance. This work provides a pathway for efficiently deploying large models in memory-constrained environments without compromising accuracy. Large Language Models (LLMs) (Brown et al., 2020; Radford et al., 2019) are transformative for natural language understanding and generation (Suzgun et al., 2022; Zhou et al., 2023); however, their extensive parameter count leads to large memory footprints and longer inference times, making them expensive to execute. Model compression methods, such as sparsity and quantization, have shown promising results in reducing the inference costs of LLMs. However, these methods often require costly retraining on large amounts of data to restore the original model accuracy (Sanh et al., 2020; Park et al., 2018), while facing numerical and optimization stability challenges when dealing with quantized weights in quantization-aware-training (Gholami et al., 2022).