Goto

Collaborating Authors

 request group


Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt. Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding. Together, these mechanisms substantially reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer improves end-to-end rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% compared to state-of-the-art synchronous RL systems, significantly accelerating RL training iterations.


One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

arXiv.org Artificial Intelligence

$ $Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources. To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems.


Competitive Ratios for Online Multi-capacity Ridesharing

arXiv.org Artificial Intelligence

In multi-capacity ridesharing, multiple requests (e.g., customers, food items, parcels) with different origin and destination pairs travel in one resource. In recent years, online multi-capacity ridesharing services (i.e., where assignments are made online) like Uber-pool, foodpanda, and on-demand shuttles have become hugely popular in transportation, food delivery, logistics and other domains. This is because multi-capacity ridesharing services benefit all parties involved { the customers (due to lower costs), the drivers (due to higher revenues) and the matching platforms (due to higher revenues per vehicle/resource). Most importantly these services can also help reduce carbon emissions (due to fewer vehicles on roads). Online multi-capacity ridesharing is extremely challenging as the underlying matching graph is no longer bipartite (as in the unit-capacity case) but a tripartite graph with resources (e.g., taxis, cars), requests and request groups (combinations of requests that can travel together). The desired matching between resources and request groups is constrained by the edges between requests and request groups in this tripartite graph (i.e., a request can be part of at most one request group in the final assignment). While there have been myopic heuristic approaches employed for solving the online multi-capacity ridesharing problem, they do not provide any guarantees on the solution quality. To that end, this paper presents the first approach with bounds on the competitive ratio for online multi-capacity ridesharing (when resources rejoin the system at their initial location/depot after serving a group of requests).