ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning
Liao, Baohao, Herold, Christian, Hashemi, Seyyed Hadi, Vasilev, Stefan, Khadivi, Shahram, Monz, Christof
–arXiv.org Artificial Intelligence
As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.
arXiv.org Artificial Intelligence
Mar-17-2025
- Country:
- Asia > Thailand
- Europe
- Austria > Vienna (0.15)
- France (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Tuscany
- Florence (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- North America
- Canada > Alberta
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California
- Los Angeles County > Long Beach (0.04)
- San Diego County > San Diego (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New York > New York County
- New York City (0.04)
- California
- Genre:
- Research Report (0.81)
- Technology: