LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits
Mirzaei, Amir Reza, Wen, Yuqiao, Cao, Yanshuai, Mou, Lili
–arXiv.org Artificial Intelligence
Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Large Language Models (LLMs) have achieved remarkable performance across a wide range of natural language tasks (Ouyang et al., 2022; Wang et al., 2022; Zhao et al., 2023), but fine-tuning LLMs for new applications remains computationally and memory intensive. To address this challenge, low-rank adaptation (LoRA; Hu et al., 2022) has emerged as a widely adopted method for parameter-efficient fine-tuning. LoRA introduces small, task-specific low-rank matrices, and during the adaptation, only these low-rank matrices are trained while the base model is frozen. An increasingly important use case of LoRA is LLM customization, as LLM providers (e.g., OpenAI and Google) allow users to personalize their own LLMs (OpenAI, 2025; Google Cloud, 2025).
arXiv.org Artificial Intelligence
Nov-10-2025
- Country:
- North America > Canada (0.28)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Information Technology (0.34)
- Technology: