Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM
Trappen, Tim, Keßler, Robert, Pabel, Roland, Achter, Viktor, Wesner, Stefan
–arXiv.org Artificial Intelligence
Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.
arXiv.org Artificial Intelligence
Nov-27-2025
- Country:
- Africa > Nigeria
- Gulf of Guinea > Niger Delta (0.04)
- Asia
- China > Sichuan Province
- Chengdu (0.04)
- India (0.04)
- Japan > Honshū
- Kansai > Hyogo Prefecture > Kobe (0.04)
- China > Sichuan Province
- Europe
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- California > San Francisco County
- San Francisco (0.14)
- Massachusetts > Middlesex County
- Waltham (0.04)
- Tennessee > Davidson County
- Nashville (0.05)
- California > San Francisco County
- Canada > Ontario
- Africa > Nigeria
- Genre:
- Research Report (0.70)
- Industry:
- Education > Educational Setting (0.50)
- Information Technology (0.95)
- Technology: