Equinox: Holistic Fair Scheduling in Serving Large Language Models

Wei, Zhixiang, Yen, James, Chen, Jingyi, Zhang, Ziyang, Huang, Zhibai, Chen, Chen, Yu, Xingzi, Gu, Yicheng, Wu, Chenggang, Wang, Yun, Xia, Mingyuan, Wu, Jie, Wang, Hao, Qi, Zhengwei

Aug-26-2025–arXiv.org Artificial Intelligence

We address the limitations of current LLM serving with a dual-counter framework separating user and operator perspectives. The User Fairness Counter measures quality of service via weighted tokens and latency; the Resource Fairness Counter measures operational efficiency through throughput and GPU utilization. Since these metrics are only available post-execution, creating a scheduling paradox, we introduce a deterministic Mixture of Prediction Experts (MoPE) framework to predict user-perceived latency, output tokens, throughput, and GPU utilization. These predictions enable calculation of a unified Holistic Fairness score that balances both counters through tunable parameters for proactive fairness-aware scheduling. We implement this in Equinox, an open-source system with other optimizations like adaptive batching, and stall-free scheduling. Evaluations on production traces (ShareGPT, LMSYS) and synthetic workloads demonstrate Equinox achieves up to $1.3\times$ higher throughput, 60\% lower time-to-first-token latency, and 13\% higher fairness versus VTC while maintaining 94\% GPU utilization, proving fairness under bounded discrepancy across heterogeneous platforms.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Aug-26-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.46)
  - Texas (0.28)

Genre:
- Research Report (0.84)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)