TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

Chen, Junyi, Du, Chuheng, Liu, Renyuan, Yao, Shuochao, Yan, Dingtian, Liao, Jiang, Liu, Shengzhong, Wu, Fan, Chen, Guihai

Oct-6-2025–arXiv.org Artificial Intelligence

Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard LLM serving systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel LLM serving system with enhanced text streaming performance via preemptive request scheduling and proactive key-value (KV) cache management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring KV cache between GPU and CPU memory in the background and overlapping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Oct-6-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.69)
- Europe > United Kingdom
  - Scotland (0.16)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found