Semantic Scheduling for LLM Inference

Hua, Wenyue, Ding, Dujian, Gu, Yile, Ren, Yujie, Mei, Kai, Ma, Minghua, Wang, William Yang

arXiv.org Artificial Intelligence 

Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. Large language models (LLMs) are increasingly prevalent in a variety of domains, serving millions of users worldwide (Y u et al., 2024; Atkinson et al., 2020). Recent efforts to enhance LLM performance have focused on efficient serving architectures (Kwon et al., 2023; Dao et al., 2022; Hua et al., 2024), with the primary objectives of lowering latency and enhancing throughput. However, as LLM applications expand into areas such as medicine (Y u et al., 2024) and law (Atkinson et al., 2020), it becomes clear that the semantics (Mei et al., 2024) of each request ( e.g., the urgency or importance of the request content) can be critical to scheduling decisions. Most LLM services currently employ a first-come-first-served (FCFS) scheduling strategy, largely because the running time for each user request is unknown.