Iyer, Ravishankar
Hierarchical Autoscaling for Large Language Model Serving with Chiron
Patke, Archit, Reddy, Dhemath, Jha, Saurabh, Narayanaswami, Chandra, Kalbarczyk, Zbigniew, Iyer, Ravishankar
Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.
Transforming the Hybrid Cloud for Emerging AI Workloads
Chen, Deming, Youssef, Alaa, Pendse, Ruchi, Schleife, Andrรฉ, Clark, Bryan K., Hamann, Hendrik, He, Jingrui, Laino, Teodoro, Varshney, Lav, Wang, Yuxiong, Sil, Avirup, Jabbarvand, Reyhaneh, Xu, Tianyin, Kindratenko, Volodymyr, Costa, Carlos, Adve, Sarita, Mendis, Charith, Zhang, Minjia, Nรบรฑez-Corrales, Santiago, Ganti, Raghu, Srivatsa, Mudhakar, Kim, Nam Sung, Torrellas, Josep, Huang, Jian, Seelam, Seetharami, Nahrstedt, Klara, Abdelzaher, Tarek, Eilam, Tamar, Zhao, Huimin, Manica, Matteo, Iyer, Ravishankar, Hirzel, Martin, Adve, Vikram, Marinov, Darko, Franke, Hubertus, Tong, Hanghang, Ainsworth, Elizabeth, Zhao, Han, Vasisht, Deepak, Do, Minh, Oliveira, Fabio, Pacifici, Giovanni, Puri, Ruchir, Nagpurkar, Priya
This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, full-stack co-design approaches, emphasizing usability, manageability, affordability, adaptability, efficiency, and scalability. By integrating cutting-edge technologies such as generative and agentic AI, cross-layer automation and optimization, unified control plane, and composable and adaptive system architecture, the proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. Incorporating quantum computing as it matures will enable quantum-accelerated simulations for materials science, climate modeling, and other high-impact domains. Collaborative efforts between academia and industry are central to this vision, driving advancements in foundation models for material design and climate solutions, scalable multimodal data processing, and enhanced physics-based AI emulators for applications like weather forecasting and carbon sequestration. Research priorities include advancing AI agentic systems, LLM as an Abstraction (LLMaaA), AI model optimization and unified abstractions across heterogeneous infrastructure, end-to-end edge-cloud transformation, efficient programming model, middleware and platform, secure infrastructure, application-adaptive cloud systems, and new quantum-classical collaborative workflows. These ideas and solutions encompass both theoretical and practical research questions, requiring coordinated input and support from the research community. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms, fostering breakthroughs in AI-driven applications and scientific discovery across academia, industry, and society.
One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
Patke, Archit, Reddy, Dhemath, Jha, Saurabh, Qiu, Haoran, Pinto, Christian, Cui, Shengkun, Narayanaswami, Chandra, Kalbarczyk, Zbigniew, Iyer, Ravishankar
$ $Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources. To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems.
Integrating Artificial Intelligence with Real-time Intracranial EEG Monitoring to Automate Interictal Identification of Seizure Onset Zones in Focal Epilepsy
Varatharajah, Yogatheesan, Berry, Brent, Cimbalnik, Jan, Kremen, Vaclav, Van Gompel, Jamie, Stead, Matt, Brinkmann, Benjamin, Iyer, Ravishankar, Worrell, Gregory
An ability to map seizure-generating brain tissue, i.e., the seizure onset zone (SOZ), without recording actual seizures could reduce the duration of invasive EEG monitoring for patients with drug-resistant epilepsy. A widely-adopted practice in the literature is to compare the incidence (events/time) of putative pathological electrophysiological biomarkers associated with epileptic brain tissue with the SOZ determined from spontaneous seizures recorded with intracranial EEG, primarily using a single biomarker. Clinical translation of the previous efforts suffers from their inability to generalize across multiple patients because of (a) the inter-patient variability and (b) the temporal variability in the epileptogenic activity. Here, we report an artificial intelligence-based approach for combining multiple interictal electrophysiological biomarkers and their temporal characteristics as a way of accounting for the above barriers and show that it can reliably identify seizure onset zones in a study cohort of 82 patients who underwent evaluation for drug-resistant epilepsy. Our investigation provides evidence that utilizing the complementary information provided by multiple electrophysiological biomarkers and their temporal characteristics can significantly improve the localization potential compared to previously published single-biomarker incidence-based approaches, resulting in an average area under ROC curve (AUC) value of 0.73 in a cohort of 82 patients. Our results also suggest that recording durations between ninety minutes and two hours are sufficient to localize SOZs with accuracies that may prove clinically relevant. The successful validation of our approach on a large cohort of 82 patients warrants future investigation on the feasibility of utilizing intra-operative EEG monitoring and artificial intelligence to localize epileptogenic brain tissue.
A Contextual-bandit-based Approach for Informed Decision-making in Clinical Trials
Varatharajah, Yogatheesan, Berry, Brent, Koyejo, Sanmi, Iyer, Ravishankar
Clinical trials involving multiple treatments utilize randomization of the treatment assignments to enable the evaluation of treatment efficacies in an unbiased manner. Such evaluation is performed in post hoc studies that usually use supervised-learning methods that rely on large amounts of data collected in a randomized fashion. That approach often proves to be suboptimal in that some participants may suffer and even die as a result of having not received the most appropriate treatments during the trial. Reinforcement-learning methods improve the situation by making it possible to learn the treatment efficacies dynamically during the course of the trial, and to adapt treatment assignments accordingly. Recent efforts using \textit{multi-arm bandits}, a type of reinforcement-learning methods, have focused on maximizing clinical outcomes for a population that was assumed to be homogeneous. However, those approaches have failed to account for the variability among participants that is becoming increasingly evident as a result of recent clinical-trial-based studies. We present a contextual-bandit-based online treatment optimization algorithm that, in choosing treatments for new participants in the study, takes into account not only the maximization of the clinical outcomes but also the patient characteristics. We evaluated our algorithm using a real clinical trial dataset from the International Stroke Trial. The results of our retrospective analysis indicate that the proposed approach performs significantly better than either a random assignment of treatments (the current gold standard) or a multi-arm-bandit-based approach, providing substantial gains in the percentage of participants who are assigned the most suitable treatments. The contextual-bandit and multi-arm bandit approaches provide 72.63% and 64.34% gains, respectively, compared to a random assignment.
EEG-GRAPH: A Factor-Graph-Based Model for Capturing Spatial, Temporal, and Observational Relationships in Electroencephalograms
Varatharajah, Yogatheesan, Chong, Min Jin, Saboo, Krishnakant, Berry, Brent, Brinkmann, Benjamin, Worrell, Gregory, Iyer, Ravishankar
This paper presents a probabilistic-graphical model that can be used to infer characteristics of instantaneous brain activity by jointly analyzing spatial and temporal dependencies observed in electroencephalograms (EEG). Specifically, we describe a factor-graph-based model with customized factor-functions defined based on domain knowledge, to infer pathologic brain activity with the goal of identifying seizure-generating brain regions in epilepsy patients. We utilize an inference technique based on the graph-cut algorithm to exactly solve graph inference in polynomial time. We validate the model by using clinically collected intracranial EEG data from 29 epilepsy patients to show that the model correctly identifies seizure-generating brain regions. Our results indicate that our model outperforms two conventional approaches used for seizure-onset localization (5-7% better AUC: 0.72, 0.67, 0.65) and that the proposed inference technique provides 3-10% gain in AUC (0.72, 0.62, 0.69) compared to sampling-based alternatives.