Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Zhu, Wenbin, Shen, Zhaoyan, Shao, Zili, Dai, Hongjun, Chen, Feng

Dec-2-2025–arXiv.org Artificial Intelligence

Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Dec-2-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)
- North America > United States (0.14)

Genre:
- Research Report > Promising Solution (0.34)

Technology:
- Information Technology
  - Hardware (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found