Jha, Saurabh
Characterizing GPU Resilience and Impact on AI/HPC Systems
Cui, Shengkun, Patke, Archit, Chen, Ziheng, Ranjan, Aditya, Nguyen, Hung, Cao, Phuong, Jha, Saurabh, Bode, Brett, Bauer, Gregory, Narayanaswami, Chandra, Sow, Daby, Di Martino, Catello, Kalbarczyk, Zbigniew T., Iyer, Ravishankar K.
In this study, we characterize GPU failures in Delta, the current large-scale AI system with over 600 petaflops of peak compute throughput. The system comprises GPU and non-GPU nodes with modern AI accelerators, such as NVIDIA A40, A100, and H100 GPUs. The study uses two and a half years of data on GPU errors. We evaluate the resilience of GPU hardware components to determine the vulnerability of different GPU components to failure and their impact on the GPU and node availability. We measure the key propagation paths in GPU hardware, GPU interconnect (NVLink), and GPU memory. Finally, we evaluate the impact of the observed GPU errors on user jobs. Our key findings are: (i) Contrary to common beliefs, GPU memory is over 30x more reliable than GPU hardware in terms of MTBE (mean time between errors). (ii) The newly introduced GSP (GPU System Processor) is the most vulnerable GPU hardware component. (iii) NVLink errors did not always lead to user job failure, and we attribute it to the underlying error detection and retry mechanisms employed. (iv) We show multiple examples of hardware errors originating from one of the key GPU hardware components, leading to application failure. (v) We project the impact of GPU node availability on larger scales with emulation and find that significant overprovisioning between 5-20% would be necessary to handle GPU failures. If GPU availability were improved to 99.9%, the overprovisioning would be reduced by 4x.
Causal AI-based Root Cause Identification: Research to Practice at Scale
Jha, Saurabh, Rahane, Ameet, Shwartz, Laura, Palaci-Olgun, Marc, Bagehorn, Frank, Rios, Jesus, Stingaciu, Dan, Kattinakere, Ragu, Banerjee, Debasish
Modern applications are increasingly built as vast, intricate, distributed systems. These systems comprise various software modules, often developed by different teams using different programming languages and deployed across hundreds to thousands of machines, sometimes spanning multiple data centers. Given the ir scale and complexity, these applications are often designed to tolerate failures and performance issues through inbuilt failure recovery techniques (e.g., hardware or software redundancy) or extern al methods (e.g., health check - based restarts). Computer systems experience frequent failures despite every effort: performance degradations and violations of reliability and K ey Performance Indicators (K PI s) are inevitable. These failures, depending on their nature, can lead to catastrophic incidents impacting critical systems and customers. Swift and accurate root cause identification is thus essential to avert significant incidents impacting both service quality and end users. In this complex landscape, observability platforms that provide deep insights into system behavior and help identify performance bottlenecks are not just helpful -- they are essential for maintaining reliability, ensuring optimal performance, and quickly resolving issues in production. The ability to reason a bout these systems in real - time is critical to ensuring the scalability and stability of modern services. To aid in these investigations, observability platforms that collect various telemetry data t o inform about application behavior and its underlying infrastructure are getting popular .
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks
Jha, Saurabh, Arora, Rohan, Watanabe, Yuji, Yanagawa, Takumi, Chen, Yinfang, Clark, Jackson, Bhavya, Bhavya, Verma, Mudit, Kumar, Harshit, Kitahara, Hirokuni, Zheutlin, Noah, Takano, Saki, Pathak, Divya, George, Felix, Wu, Xinbo, Turkkan, Bekir O., Vanloo, Gerard, Nidd, Michael, Dai, Ting, Chatterjee, Oishik, Gupta, Pranjal, Samanta, Suranjana, Aggarwal, Pooja, Lee, Rong, Murali, Pavankumar, Ahn, Jae-wook, Kar, Debanjana, Rahane, Ameet, Fonseca, Carlos, Paradkar, Amit, Deng, Yu, Moogi, Pratibha, Mohapatra, Prateeti, Abe, Naoki, Narayanaswami, Chandrasekhar, Xu, Tianyin, Varshney, Lav R., Mahindru, Ruchi, Sailer, Anca, Shwartz, Laura, Sow, Daby, Fuller, Nicholas C. M., Puri, Ruchir
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.
Hierarchical Autoscaling for Large Language Model Serving with Chiron
Patke, Archit, Reddy, Dhemath, Jha, Saurabh, Narayanaswami, Chandra, Kalbarczyk, Zbigniew, Iyer, Ravishankar
Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.
One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
Patke, Archit, Reddy, Dhemath, Jha, Saurabh, Qiu, Haoran, Pinto, Christian, Cui, Shengkun, Narayanaswami, Chandra, Kalbarczyk, Zbigniew, Iyer, Ravishankar
$ $Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources. To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems.
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Qiu, Haoran, Mao, Weichao, Patke, Archit, Cui, Shengkun, Jha, Saurabh, Wang, Chen, Franke, Hubertus, Kalbarczyk, Zbigniew T., Baลar, Tamer, Iyer, Ravishankar K.
Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings.
Watch out for the risky actors: Assessing risk in dynamic environments for safe driving
Jha, Saurabh, Miao, Yan, Kalbarczyk, Zbigniew, Iyer, Ravishankar K.
Driving in a dynamic environment that consists of other actors is inherently a risky task as each actor influences the driving decision and may significantly limit the number of choices in terms of navigation and safety plan. The risk encountered by the Ego actor depends on the driving scenario and the uncertainty associated with predicting the future trajectories of the other actors in the driving scenario. However, not all objects pose a similar risk. Depending on the object's type, trajectory, position, and the associated uncertainty with these quantities; some objects pose a much higher risk than others. The higher the risk associated with an actor, the more attention must be directed towards that actor in terms of resources and safety planning. In this paper, we propose a novel risk metric to calculate the importance of each actor in the world and demonstrate its usefulness through a case study.
BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics
Banerjee, Subho S., Jha, Saurabh, Kalbarczyk, Zbigniew T., Iyer, Ravishankar K.
Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers.
ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection
Jha, Saurabh, Banerjee, Subho S., Tsai, Timothy, Hari, Siva K. S., Sullivan, Michael B., Kalbarczyk, Zbigniew T., Keckler, Stephen W., Iyer, Ravishankar K.
Items (a), (b), and (c) are integrated into a intelligence (AI) and machine learning (ML) to integrate Bayesian network (BN). BNs provide a favorable formalism mechanical, electronic, and computing technologies to make in which to model the propagation of faults across AV system real-time driving decisions. AI enables AVs to navigate through components with an interpretable model. The model, together complex environments while maintaining a safety envelope [1], with fault injection results, can be used to design and assess [2] that is continuously measured and quantified by onboard the safety of AVs. Further, BNs enable rapid probabilistic sensors (e.g., camera, LiDAR, RADAR) [3]-[5]. Clearly, the inference, which allows DriveFI to quickly find safety-critical safety and resilience of AVs are of significant concern, as faults. The Bayesian FI framework can be extended to other exemplified by several headline-making AV crashes [6], [7], safety-critical systems (e.g., surgical robots). The framework as well as prior work characterizing AV resilience during road requires specification of the safety constraints and the system tests [8]. Hence there is a compelling need for a comprehensive software architecture to model causal relationship between assessment of AV technology.