HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Wu, Taiqiang, Ding, Chenchen, Zhou, Wenyong, Cheng, Yuxin, Feng, Xincheng, Wang, Shuqi, Shi, Chufan, Liu, Zhengwu, Wong, Ngai

Mar-3-2025–arXiv.org Artificial Intelligence

--Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). T o address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels. Large language models (LLMs), such as GPT -4 [9], LLaMA [6], and Qwen [10], have demonstrated promising performance in various Natural Language Processing (NLP) tasks. However, this success, primarily driven by massive model parameters, brings forth two critical challenges in practical applications. First, adapting LLMs to downstream tasks via full model fine-tuning requires prohibitive computational resources.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Mar-3-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.90)
  - Natural Language > Large Language Model (1.00)