s-lora
PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint
Vasani, Bhoomit, FitzGerald, Jack, Fang, Anjie, Vaish, Sushmit
We introduce PHLoRA (Pronounced "flora"). (Post-hoc LoRA), a simple yet powerful method to extract low-rank adaptation adapters from full-rank fine-tuned models without requiring access to training data or gradients. By computing the low-rank decomposition of weight differences between a base model and its fine-tuned counterpart, our method reconstructs adapter modules that can be merged or dynamically routed at inference time via S-LoRA, or served in scalable, industry settings using platforms like NVIDIA NIM. This approach amortizes latency overhead across requests and yields substantial cost savings. Unlike prior work that trains each adapter explicitly, our approach decouples fine-tuning from adapter generation, allowing adapter extraction from existing full-rank models or third-party checkpoints. Experiments on text, image, and video benchmarks using the Amazon Nova model family demonstrate that extracted adapters preserve high energy from the full weight delta, can be pruned safely, and yield negligible degradation in downstream task performance when re-merged. Overall, PHLoRA provides a practical path for making all existing full-rank checkpoints adapter-ready, democratizing scalable inference for all models.
S-LoRA: Scalable Low-Rank Adaptation for Class Incremental Learning
Wu, Yichen, Piao, Hongming, Huang, Long-Kai, Wang, Renzhen, Li, Wanhua, Pfister, Hanspeter, Meng, Deyu, Ma, Kede, Wei, Ying
Continual Learning (CL) with foundation models has recently emerged as a promising approach to harnessing the power of pre-trained models for sequential tasks. Existing prompt-based methods generally use a prompt selection mechanism to select relevant prompts aligned with the test query for further processing. However, the success of these methods largely depends on the precision of the selection mechanism, which also raises scalable issues with additional computational overhead as tasks increase. To overcome these issues, we propose a Scalable Low-Rank Adaptation (S-LoRA) method for class incremental learning, which incrementally decouples the learning of the direction and magnitude of LoRA parameters. S-LoRA supports efficient inference by employing the last-stage trained model for direct testing without the selection process. Our theoretical and empirical analysis demonstrates that S-LoRA tends to follow a low-loss trajectory that converges to an overlapped low-loss region, resulting in an excellent stability-plasticity trade-off in CL. Furthermore, based on our findings, we develop variants of S-LoRA with further improved scalability. Continual Learning (CL) (Rolnick et al., 2019; Wang et al., 2024b; Zhou et al., 2024; Wang et al., 2022b) seeks to develop a learning system that can continually adapt to changing environments while retaining previously acquired knowledge.
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Sheng, Ying, Cao, Shiyi, Li, Dacheng, Hooper, Coleman, Lee, Nicholas, Yang, Shuo, Chou, Christopher, Zhu, Banghua, Zheng, Lianmin, Keutzer, Kurt, Gonzalez, Joseph E., Stoica, Ion
The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at https://github.com/S-LoRA/S-LoRA