Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI

Nov-25-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, from conversational AI to code generation and content creation [1, 2, 3]. However, the deployment of these models in production environments presents significant engineering challenges. The computational demands of autoregressive text generation, combined with the massive parameter counts of modern LLMs, necessitate specialized serving infrastructure that can efficiently manage GPU resources while meeting application-specific performance requirements. The serving infrastructure for LLMs must address several competing objectives: maximizing throughput to serve many concurrent users, minimizing latency for responsive user experiences, and efficiently utilizing expensive GPU resources. Different applications prioritize these objectives differently--a chatbot requires low latency for individual requests, while a batch document processing system prioritizes throughput. This variation in requirements has led to the development of specialized serving frameworks, each making different design trade-offs. Among the available open-source solutions, vLLM [4] and HuggingFace Text Generation Inference (TGI) [5] have emerged as leading frameworks, widely adopted in both research and production settings.

large language model, machine learning, throughput, (17 more...)

arXiv.org Artificial Intelligence

Nov-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.32)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found