Compressing Large Language Models using Low Rank and Low Precision Decomposition
–Neural Information Processing Systems
This work introduces CALDERA - a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W by approximating it via a lowrank, low-precision decomposition as W Q + LR. Here, L and R are low rank factors, and the entries of Q, L and R are quantized. The model is compressed by substituting each layer with its Q + LR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L and R are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance.
Neural Information Processing Systems
May-31-2025, 18:21:41 GMT
- Country:
- North America > United States (0.68)
- Genre:
- Research Report > Experimental Study (0.92)
- Industry:
- Government (0.68)
- Information Technology > Security & Privacy (0.45)
- Technology: