Goto

Collaborating Authors

 slo


ProbSelect: Stochastic Client Selection for GPU-Accelerated Compute Devices in the 3D Continuum

Stanisic, Andrija, Nastic, Stefan

arXiv.org Artificial Intelligence

Abstract--Integration of edge, cloud and space devices into a unified 3D continuum imposes significant challenges for client selection in federated learning systems. Traditional approaches rely on continuous monitoring and historical data collection, which becomes impractical in dynamic environments where satellites and mobile devices frequently change operational conditions. Furthermore, existing solutions primarily consider CPU-based computation, failing to capture complex characteristics of GPU-accelerated training that is prevalent across the 3D continuum. This paper introduces ProbSelect, a novel approach utilizing analytical modeling and probabilistic forecasting for client selection on GPU-accelerated devices, without requiring historical data or continuous monitoring. Extensive evaluation across diverse GPU architectures and workloads demonstrates that ProbSelect improves SLO compliance by 13.77% on average while achieving 72.5% computational waste reduction compared to baseline approaches.


From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models

Cui, Xingqi, Liang, Chieh-Jan Mike, Xing, Jiarong, Qiu, Haoran

arXiv.org Artificial Intelligence

Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static provisioning or model-level autoscaling, both of which treat the model as a monolith. This coarse-grained resource management leads to degraded performance or significant resource underutilization due to poor adaptability to dynamic inference traffic that is common online. The root cause of this inefficiency lies in the internal structure of generative models: they are executed as graphs of interconnected operators. Through detailed characterization and systematic analysis, we find that operators are heterogeneous in their compute and memory footprints and exhibit diverse sensitivity to workload and resource factors such as batch size, sequence length, and traffic rate. This heterogeneity suggests that the operator, rather than the entire model, is the right granularity for scaling decisions. We propose an operator-level autoscaling framework, which allocates resources at finer (operator)-granularity, optimizing the scaling, batching, and placement based on individual operator profiles. Evaluated on production-scale traces, our approach preserves SLOs with up to 40% fewer GPUs and 35% less energy, or under fixed resources achieves 1.6x higher throughput with 5% less energy. These results show that the operator, rather than the model, is fundamentally a more effective unit for scaling large generative workloads.


SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference

Zhang, Zongshun, Matta, Ibrahim

arXiv.org Artificial Intelligence

Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23\%$ while efficiently adapting to dynamic workloads.


PolyServe: Efficient Multi-SLO Serving at Scale

Zhu, Kan, Shi, Haiyang, Xu, Le, Shan, Jiaxin, Krishnamurthy, Arvind, Kasikci, Baris, Xie, Liguang

arXiv.org Artificial Intelligence

Advances in Large Language Models (LLMs) have led to a surge of LLM-powered applications. These applications have diverse token-generation latency requirements. As a result, simply classifying workloads as latency-sensitive (LS) or best-effort (BE) overlooks the nuances within the latency-sensitive category and results in suboptimal user experiences and scheduling opportunities. However, efficiently serving requests with multiple SLO requirements poses significant challenges. First, all requests within a batch generate new tokens simultaneously, which can misalign them with their distinct SLO requirements. Moreover, while existing systems focus on auto-scaling for handling various overall request rates, the diversity of SLOs necessitates fine-grained auto-scaling among these SLO tiers. Finally, unlike LS/BE scenarios, where BE requests can be aborted at any time to ensure the SLO attainment of LS requests, those with different latency-sensitive SLOs cannot tolerate prolonged delays, and tail latency must be controlled. To tackle these challenges, we propose PolyServe, a novel multi-SLO scheduling policy at scale that maintains high SLO attainment while maximizing throughput. PolyServe first groups requests into multiple bins based on their per-token latency requirement, then schedules each bin to a subset of the server fleet. PolyServe routes requests to the highest-load but still SLO-attainable server to create a load gradient that facilitates auto-scaling. To increase utilization, PolyServe permits looser-SLO requests to share tighter-SLO instances when their own servers are saturated. PolyServe uses profiling data to guide scheduling decisions and manage tail latency through request-wait-time-aware scheduling, dynamic chunking, and continuous chunked prefill prediction. PolyServe achieves 1.23x goodput gain compared to existing policies, achieving up to 92.5% of optimal goodput.


ConsumerBench: Benchmarking Generative AI Applications on End-User Devices

Gu, Yile, Kadekodi, Rohan, Nguyen, Hoang, Kamahori, Keisuke, Liu, Yiyu, Kasikci, Baris

arXiv.org Artificial Intelligence

The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.


SCORPIO: Serving the Right Requests at the Right Time for Heterogeneous SLOs in LLM Inference

Tang, Yinghao, Lan, Tingfeng, Huang, Xiuqi, Lu, Hui, Chen, Wei

arXiv.org Artificial Intelligence

Existing Large Language Model (LLM) serving systems prioritize maximum throughput. They often neglect Service Level Objectives (SLOs) such as Time to First Token (TTFT) and Time Per Output Token (TPOT), which leads to suboptimal SLO attainment. This paper introduces SCORPIO, an SLO-oriented LLM serving system designed to maximize system goodput and SLO attainment for workloads with heterogeneous SLOs. Our core insight is to exploit SLO heterogeneity for adaptive scheduling across admission control, queue management, and batch selection. SCORPIO features a TTFT Guard, which employs least-deadline-first reordering and rejects unattainable requests, and a TPOT Guard, which utilizes a VBS-based admission control and a novel credit-based batching mechanism. Both guards are supported by a predictive module. Evaluations demonstrate that SCORPIO improves system goodput by up to 14.4X and SLO adherence by up to 46.5% compared to state-of-the-art baselines.


Ascendra: Dynamic Request Prioritization for Efficient LLM Serving

Ikram, Azam, Li, Xiang, Elnikety, Sameh, Bagchi, Saurabh

arXiv.org Artificial Intelligence

The rapid advancement of Large Language Models (LLMs) has driven the need for more efficient serving strategies. In this context, efficiency refers to the proportion of requests that meet their Service Level Objectives (SLOs), particularly for Time To First Token (TTFT) and Time Between Tokens (TBT). However, existing systems often prioritize one metric at the cost of the other. We present Ascendra, an LLM serving system designed to meet both TTFT and TBT SLOs simultaneously. The core insight behind Ascendra is that a request's urgency evolves as it approaches its deadline. To leverage this, Ascendra partitions GPU resources into two types of instances: low-priority and high-priority. Low-priority instances maximize throughput by processing requests out of arrival order, but at the risk of request starvation. To address this, Ascendra employs a performance model to predict requests at risk of missing their SLOs and proactively offloads them to high-priority instances. High-priority instances are optimized for low-latency execution and handle urgent requests nearing their deadlines. This partitioned architecture enables Ascendra to effectively balance high throughput and low latency. Extensive evaluation shows that Ascendra improves system throughput by up to 1.7x compared to vLLM and Sarathi-Serve while meeting both TTFT and TBT SLOs.


Bandwidth Allocation for Cloud-Augmented Autonomous Driving

Schafhalter, Peter, Krentsel, Alexander, Gonzalez, Joseph E., Ratnasamy, Sylvia, Shenker, Scott, Stoica, Ion

arXiv.org Artificial Intelligence

Autonomous vehicle (AV) control systems increasingly rely on ML models for tasks such as perception and planning. Current practice is to run these models on the car's local hardware due to real-time latency constraints and reliability concerns, which limits model size and thus accuracy. Prior work has observed that we could augment current systems by running larger models in the cloud, relying on faster cloud runtimes to offset the cellular network latency. However, prior work does not account for an important practical constraint: limited cellular bandwidth. We show that, for typical bandwidth levels, proposed techniques for cloud-augmented AV models take too long to transfer data, thus mostly falling back to the on-car models and resulting in no accuracy improvement. In this work, we show that realizing cloud-augmented AV models requires intelligent use of this scarce bandwidth, i.e. carefully allocating bandwidth across tasks and providing multiple data compression and model options. We formulate this as a resource allocation problem to maximize car utility, and present our system \sysname which achieves an increase in average model accuracy by up to 15 percentage points on driving scenarios from the Waymo Open Dataset.


AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications

Shen, Haiying, Sen, Tanmoy

arXiv.org Artificial Intelligence

In this paper, we consider a mixed-prompt scenario for a large language model (LLM) inference serving system that supports diverse applications with both short prompts and long prompts and heterogeneous SLOs for iteration time. To improve throughput when handling long prompts, previous research introduces a chunking method, but has not addressed heterogeneous SLOs. To address the limitation, we propose AccelGen, a high-throughput LLM inference serving system with heterogeneous SLO guarantees for diverse applications. AccelGen introduces four core components: (1) SLO-guaranteed dynamic chunking, which dynamically adjusts chunk sizes to maximize GPU compute utilization while meeting iteration-level SLOs; (2) Iteration-level SLO-based task prioritization, which prioritizes tight-SLO requests and batches requests with similar SLOs; (3) Multi-resource-aware batching, which selects queued requests to maximize the utilizations of both GPU compute resource and key-value cache (KVC). Trace-driven real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches. It achieves performance near the Oracle, which optimally maximizes goodput.


Mitigating KV Cache Competition to Enhance User Experience in LLM Inference

Shen, Haiying, Sen, Tanmoy

arXiv.org Artificial Intelligence

In Large Language Model (LLM) serving, the KV-cache (KVC) bottleneck causes high tail Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT), impairing user experience, particularly in time-sensitive applications. However, satisfying both TTFT and TBT service-level objectives (SLOs) is challenging. To address this, we propose a system, named CacheOPT for mitigating KV Cache competition, based on key insights from our measurements, incorporating novel components. First, it estimates a request's output length, bounding the deviation with a high specified probability, adjusted based on the request arrival rate. Second, it allocates the estimated KVC demand to a request, and reuses other requests' allocated KVC to avoid preemptions while reducing waiting time. Third, it proactively allocates KVC before instead of at the time a request exhausts its allocation and reserves KVC globally to prevent preemptions. Fourth, it chooses a request that has long TBT SLO, long job remaining time and short preemption time to preempt. Fifth, it selects the shortest-latency strategy between swapping and recomputation for preemptions. Experiments show that CacheOPT achieves up to 3.29$\times$ and 2.83$\times$ lower tail TBT and tail TTFT, 47\% and 53\% higher TTFT and TBT SLO attainments, and supports up to 1.58$\times$ higher request arrival rate than the state-of-the-art methods.