SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching

Zhang, Tianyi, Su, Junda, Wu, Oscar, Xu, Zhaozhuo, Shrivastava, Anshumali

arXiv.org Artificial Intelligence 

Compressive adaptation approaches, such as QLoRA, are widely popular alternatives for reducing memory requirements during fine-tuning of large language models (LLMs) while producing models capable of handling various downstream tasks. The key idea is to employ a "two-tower" architecture: compressing pretrained LLM parameters into compact representations and fine-tuning the additive full-precision adapter, which typically has few tunable parameters in low-rank format. However, the strict algebraic assumptions, such as low-rank assumption, and the complexity of composing two-tower architectures are some of the known shortcomings, resulting in a poor accuracy-efficiency trade-off. In response to these known limitations, we propose SpaLLM (Sketched Parameter Adaptation of LLMs), a novel compressive adaptation approach for LLMs. This method is also the first to illustrate parameter-sharing compression methods for LLM finetuning, which, unlike QLoRA, are free from strict low-rank algebraic assumptions on adapters. This approach simplifies LLMs' compressive adaptation workflow, potentially improves multi-user serving efficiency, and delivers significantly better accuracy for both natural language understanding and generation tasks. Moreover, by avoiding the "two-tower" architecture, our framework only requires one compressed matrix multiplication per layer during inference, demonstrating superior inference efficiency compared to previous methods. Recent advancements in Large Language Models (LLMs) have demonstrated exceptional performance in Natural Language Processing (NLP), enabling a broad spectrum of downstream applications. LLMs have demonstrated impressive generalization abilities across many downstream tasks in a zero-shot manner. However, compared to training-free methods such as in-context learning (Dong et al., 2022; Rubin et al., 2021) and few-shot prompting (Brown, 2020; Song et al., 2023), fine-tuning on these LLMs is often the ideal method to achieve optimal performance on a specific downstream task (Ding et al., 2023). Clearly, full-precision fine-tuning on these LLMs are often impractical due to the massive requirement of high-performance computing devices such as GPUs. As a result, Parameter-Efficient Fine-Tuning methods (PEFT), such as Low-Rank Adaptation (LoRA) (Hu et al., 2022), emerged as a less resource-intensive approach to fine-tuning while achieving reasonable Clearly, there is a trade-off between accuracy and efficiency.