Punica: Multi-Tenant LoRA Serving

Chen, Lequn, Ye, Zihao, Wu, Yongji, Zhuo, Danyang, Ceze, Luis, Krishnamurthy, Arvind

Oct-27-2023–arXiv.org Artificial Intelligence

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. We thus need to enable batching for different LoRA models. We increasingly popular in specializing pre-trained large thus only need to focus on the decode stage performance. LoRA retains the weights of the pretrained we can apply straightforward techniques, e.g., on-demand model and introduces trainable rank decomposition loading of LoRA model weights.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-27-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.14)
  - Hawaii (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)
    - Natural Language > Large Language Model (1.00)
  - Hardware (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found