Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Trappen, Tim, Keßler, Robert, Pabel, Roland, Achter, Viktor, Wesner, Stefan

Nov-27-2025–arXiv.org Artificial Intelligence

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.

gateway, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Nov-27-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.70)
- Asia (0.68)
- Europe > Germany
  - North Rhine-Westphalia (0.14)

Genre:
- Research Report (0.70)

Industry:
- Information Technology (0.95)
- Education > Educational Setting (0.50)

Technology:
- Information Technology
  - Scientific Computing (1.00)
  - Cloud Computing (1.00)
  - Artificial Intelligence
    - Natural Language
      - Large Language Model (0.71)
      - Chatbot (0.47)
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found