Compressing Large Language Models using Low Rank and Low Precision Decomposition
–Neural Information Processing Systems
This work introduces \rm CALDERA -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix \mathbf{W} by approximating it via a low-rank, low-precision decomposition as \mathbf{W} \approx \mathbf{Q} \mathbf{L}\mathbf{R} . Here, \mathbf{L} and \mathbf{R} are low rank factors, and the entries of \mathbf{Q}, \mathbf{L} and \mathbf{R} are quantized. The model is compressed by substituting each layer with its \mathbf{Q} \mathbf{L}\mathbf{R} decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, \mathbf{L} and \mathbf{R} are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. Theoretical upper bounds on the approximation error of \rm CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget.
Neural Information Processing Systems
May-27-2025, 11:06:35 GMT
- Technology: