AITopics | Iyer, Ravishankar K.

Collaborating Authors

Iyer, Ravishankar K.

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Characterizing GPU Resilience and Impact on AI/HPC Systems

Cui, Shengkun, Patke, Archit, Chen, Ziheng, Ranjan, Aditya, Nguyen, Hung, Cao, Phuong, Jha, Saurabh, Bode, Brett, Bauer, Gregory, Narayanaswami, Chandra, Sow, Daby, Di Martino, Catello, Kalbarczyk, Zbigniew T., Iyer, Ravishankar K.

arXiv.org Artificial IntelligenceMar-14-2025

In this study, we characterize GPU failures in Delta, the current large-scale AI system with over 600 petaflops of peak compute throughput. The system comprises GPU and non-GPU nodes with modern AI accelerators, such as NVIDIA A40, A100, and H100 GPUs. The study uses two and a half years of data on GPU errors. We evaluate the resilience of GPU hardware components to determine the vulnerability of different GPU components to failure and their impact on the GPU and node availability. We measure the key propagation paths in GPU hardware, GPU interconnect (NVLink), and GPU memory. Finally, we evaluate the impact of the observed GPU errors on user jobs. Our key findings are: (i) Contrary to common beliefs, GPU memory is over 30x more reliable than GPU hardware in terms of MTBE (mean time between errors). (ii) The newly introduced GSP (GPU System Processor) is the most vulnerable GPU hardware component. (iii) NVLink errors did not always lead to user job failure, and we attribute it to the underlying error detection and retry mechanisms employed. (iv) We show multiple examples of hardware errors originating from one of the key GPU hardware components, leading to application failure. (v) We project the impact of GPU node availability on larger scales with emulation and find that significant overprovisioning between 5-20% would be necessary to handle GPU failures. If GPU availability were improved to 99.9%, the overprovisioning would be reduced by 4x.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2503.11901

Country: North America > United States > Illinois > Champaign County > Urbana (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology (1.00)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Qiu, Haoran, Mao, Weichao, Patke, Archit, Cui, Shengkun, Jha, Saurabh, Wang, Chen, Franke, Hubertus, Kalbarczyk, Zbigniew T., Başar, Tamer, Iyer, Ravishankar K.

arXiv.org Artificial IntelligenceApr-12-2024

Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.08509

Country: North America > United States (0.68)

Genre: Research Report (0.82)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

REMEDI: REinforcement learning-driven adaptive MEtabolism modeling of primary sclerosing cholangitis DIsease progression

Hu, Chang, Saboo, Krishnakant V., Ali, Ahmad H., Juran, Brian D., Lazaridis, Konstantinos N., Iyer, Ravishankar K.

arXiv.org Artificial IntelligenceOct-2-2023

Primary sclerosing cholangitis (PSC) is a rare disease wherein altered bile acid metabolism contributes to sustained liver injury. This paper introduces REMEDI, a framework that captures bile acid dynamics and the body's adaptive response during PSC progression that can assist in exploring treatments. REMEDI merges a differential equation (DE)-based mechanistic model that describes bile acid metabolism with reinforcement learning (RL) to emulate the body's adaptations to PSC continuously. An objective of adaptation is to maintain homeostasis by regulating enzymes involved in bile acid metabolism. These enzymes correspond to the parameters of the DEs. REMEDI leverages RL to approximate adaptations in PSC, treating homeostasis as a reward signal and the adjustment of the DE parameters as the corresponding actions. On real-world data, REMEDI generated bile acid dynamics and parameter adjustments consistent with published findings. Also, our results support discussions in the literature that early administration of drugs that suppress bile acid synthesis may be effective in PSC treatment.

bile acid, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2310.01426

Country: North America > United States (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Gastroenterology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)

Add feedback

RACR-MIL: Weakly Supervised Skin Cancer Grading using Rank-Aware Contextual Reasoning on Whole Slide Images

Choudhary, Anirudh, Hwang, Angelina, Kechter, Jacob, Saboo, Krishnakant, Bordeaux, Blake, Bhullar, Puneet, Comfere, Nneka, DiCaudo, David, Nelson, Steven, Johnson, Emma, Swanson, Leah, Murphree, Dennis, Mangold, Aaron, Iyer, Ravishankar K.

arXiv.org Artificial IntelligenceAug-29-2023

Cutaneous squamous cell cancer (cSCC) is the second most common skin cancer in the US. It is diagnosed by manual multi-class tumor grading using a tissue whole slide image (WSI), which is subjective and suffers from inter-pathologist variability. We propose an automated weakly-supervised grading approach for cSCC WSIs that is trained using WSI-level grade and does not require fine-grained tumor annotations. The proposed model, RACR-MIL, transforms each WSI into a bag of tiled patches and leverages attention-based multiple-instance learning to assign a WSI-level grade. We propose three key innovations to address general as well as cSCC-specific challenges in tumor grading. First, we leverage spatial and semantic proximity to define a WSI graph that encodes both local and non-local dependencies between tumor regions and leverage graph attention convolution to derive contextual patch features. Second, we introduce a novel ordinal ranking constraint on the patch attention network to ensure that higher-grade tumor regions are assigned higher attention. Third, we use tumor depth as an auxiliary task to improve grade classification in a multitask learning framework. RACR-MIL achieves 2-9% improvement in grade classification over existing weakly-supervised approaches on a dataset of 718 cSCC tissue images and localizes the tumor better. The model achieves 5-20% higher accuracy in difficult-to-classify high-risk grade classes and is robust to class imbalance.

artificial intelligence, machine learning, weakly supervised skin cancer grading, (3 more...)

arXiv.org Artificial Intelligence

2308.15618

Country: North America > United States (0.24)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Therapeutic Area > Oncology > Skin Cancer (0.60)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.53)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.40)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.40)

Add feedback

Watch out for the risky actors: Assessing risk in dynamic environments for safe driving

Jha, Saurabh, Miao, Yan, Kalbarczyk, Zbigniew, Iyer, Ravishankar K.

arXiv.org Artificial IntelligenceOct-19-2021

Driving in a dynamic environment that consists of other actors is inherently a risky task as each actor influences the driving decision and may significantly limit the number of choices in terms of navigation and safety plan. The risk encountered by the Ego actor depends on the driving scenario and the uncertainty associated with predicting the future trajectories of the other actors in the driving scenario. However, not all objects pose a similar risk. Depending on the object's type, trajectory, position, and the associated uncertainty with these quantities; some objects pose a much higher risk than others. The higher the risk associated with an actor, the more attention must be directed towards that actor in terms of resources and safety planning. In this paper, we propose a novel risk metric to calculate the importance of each actor in the world and demonstrate its usefulness through a case study.

artificial intelligence, ground transportation, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2110.09998

Genre: Research Report (0.50)

Industry:

Transportation (0.48)
Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.47)

Add feedback

BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics

Banerjee, Subho S., Jha, Saurabh, Kalbarczyk, Zbigniew T., Iyer, Ravishankar K.

arXiv.org Artificial IntelligenceFeb-22-2021

Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers.

bayesperf, deep learning, neural network, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3445814.3446739

2102.10837

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.81)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.83)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection

Jha, Saurabh, Banerjee, Subho S., Tsai, Timothy, Hari, Siva K. S., Sullivan, Michael B., Kalbarczyk, Zbigniew T., Keckler, Stephen W., Iyer, Ravishankar K.

arXiv.org Machine LearningJul-1-2019

Items (a), (b), and (c) are integrated into a intelligence (AI) and machine learning (ML) to integrate Bayesian network (BN). BNs provide a favorable formalism mechanical, electronic, and computing technologies to make in which to model the propagation of faults across AV system real-time driving decisions. AI enables AVs to navigate through components with an interpretable model. The model, together complex environments while maintaining a safety envelope [1], with fault injection results, can be used to design and assess [2] that is continuously measured and quantified by onboard the safety of AVs. Further, BNs enable rapid probabilistic sensors (e.g., camera, LiDAR, RADAR) [3]-[5]. Clearly, the inference, which allows DriveFI to quickly find safety-critical safety and resilience of AVs are of significant concern, as faults. The Bayesian FI framework can be extended to other exemplified by several headline-making AV crashes [6], [7], safety-critical systems (e.g., surgical robots). The framework as well as prior work characterizing AV resilience during road requires specification of the safety constraints and the system tests [8]. Hence there is a compelling need for a comprehensive software architecture to model causal relationship between assessment of AV technology.

artificial intelligence, ground transportation, scenario, (19 more...)

arXiv.org Machine Learning

1907.01051

Country: North America > United States > Illinois (0.14)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)
Information Technology > Robotics & Automation (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback