Prompt-Aware Scheduling for Low-Latency LLM Serving

Tao, Yiheng, Zhang, Yihe, Dearing, Matthew T., Wang, Xin, Fan, Yuping, Lan, Zhiling

Oct-13-2025–arXiv.org Artificial Intelligence

Abstract--Efficient scheduling of large language model (LLM) inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First Come, First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and seamlessly integrates into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs. Large language models (LLMs) have emerged as core engines for artificial intelligence applications, demonstrating remarkable capabilities in a wide range of tasks, including question answering, code generation, and text classification.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-13-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Illinois (0.15)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found