Equinox: Holistic Fair Scheduling in Serving Large Language Models
Wei, Zhixiang, Yen, James, Chen, Jingyi, Zhang, Ziyang, Huang, Zhibai, Chen, Chen, Yu, Xingzi, Gu, Yicheng, Wu, Chenggang, Wang, Yun, Xia, Mingyuan, Wu, Jie, Wang, Hao, Qi, Zhengwei
–arXiv.org Artificial Intelligence
We address the limitations of current LLM serving with a dual-counter framework separating user and operator perspectives. The User Fairness Counter measures quality of service via weighted tokens and latency; the Resource Fairness Counter measures operational efficiency through throughput and GPU utilization. Since these metrics are only available post-execution, creating a scheduling paradox, we introduce a deterministic Mixture of Prediction Experts (MoPE) framework to predict user-perceived latency, output tokens, throughput, and GPU utilization. These predictions enable calculation of a unified Holistic Fairness score that balances both counters through tunable parameters for proactive fairness-aware scheduling. We implement this in Equinox, an open-source system with other optimizations like adaptive batching, and stall-free scheduling. Evaluations on production traces (ShareGPT, LMSYS) and synthetic workloads demonstrate Equinox achieves up to $1.3\times$ higher throughput, 60\% lower time-to-first-token latency, and 13\% higher fairness versus VTC while maintaining 94\% GPU utilization, proving fairness under bounded discrepancy across heterogeneous platforms.
arXiv.org Artificial Intelligence
Aug-26-2025
- Country:
- North America > United States
- California (0.46)
- Texas (0.28)
- North America > United States
- Genre:
- Research Report (0.84)
- Industry:
- Information Technology (0.46)
- Technology: