Compressing Large Language Models using Low Rank and Low Precision Decomposition

May-31-2025, 18:21:41 GMT–Neural Information Processing Systems

This work introduces CALDERA - a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W by approximating it via a lowrank, low-precision decomposition as W Q + LR. Here, L and R are low rank factors, and the entries of Q, L and R are quantized. The model is compressed by substituting each layer with its Q + LR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L and R are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

May-31-2025, 18:21:41 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.68)

Genre:
- Research Report > Experimental Study (0.92)

Industry:
- Government (0.68)
- Information Technology > Security & Privacy (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)
  - Natural Language > Large Language Model (1.00)