Compressing Large Language Models using Low Rank and Low Precision Decomposition

May-27-2025, 11:06:35 GMT–Neural Information Processing Systems

This work introduces \rm CALDERA -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix \mathbf{W} by approximating it via a low-rank, low-precision decomposition as \mathbf{W} \approx \mathbf{Q} \mathbf{L}\mathbf{R} . Here, \mathbf{L} and \mathbf{R} are low rank factors, and the entries of \mathbf{Q}, \mathbf{L} and \mathbf{R} are quantized. The model is compressed by substituting each layer with its \mathbf{Q} \mathbf{L}\mathbf{R} decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, \mathbf{L} and \mathbf{R} are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. Theoretical upper bounds on the approximation error of \rm CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget.

language model, mathbf, rank and low precision decomposition, (3 more...)

Neural Information Processing Systems

May-27-2025, 11:06:35 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)