Goto

Collaborating Authors

 Mendis, Charith


Transforming the Hybrid Cloud for Emerging AI Workloads

arXiv.org Artificial Intelligence

This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, full-stack co-design approaches, emphasizing usability, manageability, affordability, adaptability, efficiency, and scalability. By integrating cutting-edge technologies such as generative and agentic AI, cross-layer automation and optimization, unified control plane, and composable and adaptive system architecture, the proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. Incorporating quantum computing as it matures will enable quantum-accelerated simulations for materials science, climate modeling, and other high-impact domains. Collaborative efforts between academia and industry are central to this vision, driving advancements in foundation models for material design and climate solutions, scalable multimodal data processing, and enhanced physics-based AI emulators for applications like weather forecasting and carbon sequestration. Research priorities include advancing AI agentic systems, LLM as an Abstraction (LLMaaA), AI model optimization and unified abstractions across heterogeneous infrastructure, end-to-end edge-cloud transformation, efficient programming model, middleware and platform, secure infrastructure, application-adaptive cloud systems, and new quantum-classical collaborative workflows. These ideas and solutions encompass both theoretical and practical research questions, requiring coordinated input and support from the research community. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms, fostering breakthroughs in AI-driven applications and scientific discovery across academia, industry, and society.


TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

arXiv.org Artificial Intelligence

Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the autotuner for XLA, a machine learning compiler, discovered 10-20% speedup on state-of-the-art models serving substantial production traffic at Google. Although there exist a few datasets for program performance prediction, they target small sub-programs such as basic blocks or kernels. This paper introduces TpuGraphs, a performance prediction dataset on full tensor programs, represented as computational graphs, running on Tensor Processing Units (TPUs). Each graph in the dataset represents the main computation of a machine learning workload, e.g., a training epoch or an inference step. Each data sample contains a computational graph, a compilation configuration, and the execution time of the graph when compiled with the configuration. The graphs in the dataset are collected from open-source machine learning programs, featuring popular model architectures, e.g., ResNet, EfficientNet, Mask R-CNN, and Transformer. TpuGraphs provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. This graph-level prediction task on large graphs introduces new challenges in learning, ranging from scalability, training efficiency, to model quality.


Learning Large Graph Property Prediction via Graph Segment Training

arXiv.org Artificial Intelligence

Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST-EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.


FLuRKA: Fast fused Low-Rank & Kernel Attention

arXiv.org Artificial Intelligence

Many efficient approximate self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its own strengths. We observe these strengths synergistically complement each other and exploit these synergies to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA (Fast Low-Rank and Kernel Attention). FLuRKA provide sizable performance gains over these approximate techniques and are of high quality. We theoretically and empirically evaluate both the runtime performance and quality of FLuRKA. Our runtime analysis posits a variety of parameter configurations where FLuRKA exhibit speedups and our accuracy analysis bounds the error of FLuRKA with respect to full-attention. We instantiate three FLuRKA variants which experience empirical speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 30x over models with full-attention. With respect to model quality, FLuRKA can match the accuracy of low-rank and kernel methods on GLUE after pre-training on wiki-text 103. When pre-training on a fixed time budget, FLuRKA yield better perplexity scores than models with full-attention.


Input-sensitive dense-sparse primitive compositions for GNN acceleration

arXiv.org Artificial Intelligence

Graph neural networks (GNN) have become an important class of neural network models that have gained popularity in domains such as social and financial network analysis. Different phases of GNN computations can be modeled using both dense and sparse matrix operations. There have been many frameworks and optimization techniques proposed in the literature to accelerate GNNs. However, getting consistently high performance across many input graphs with different sparsity patterns and GNN embedding sizes has remained difficult. In this paper, we propose different algebraic reassociations of GNN computations that lead to novel dense and sparse matrix primitive selections and compositions. We show that the profitability of these compositions depends on the input graph, embedding size, and the target hardware. We developed SENSEi, a system that uses a data-driven adaptive strategy to select the best composition given the input graph and GNN embedding sizes. Our evaluations on a wide range of graphs and embedding sizes show that SENSEi achieves geomean speedups of $1.105\times$ (up to $2.959\times$) and $1.187\times$ (up to $1.99\times$) on graph convolutional networks and geomean speedups of $2.307\times$ (up to $35.866\times$) and $1.44\times$ (up to $5.69\times$) on graph attention networks on CPUs and GPUs respectively over the widely used Deep Graph Library. Further, we show that the compositions yield notable synergistic performance benefits on top of other established sparse optimizations such as sparse matrix tiling by evaluating against a well-tuned baseline.


COMET: X86 Cost Model Explanation Framework

arXiv.org Artificial Intelligence

ML-based program cost models have been shown to yield fairly accurate program cost predictions. They can replace heavily-engineered analytical program cost models in mainstream compilers, but their black-box nature discourages their adoption. In this work, we propose the first framework, COMET, for generating faithful, generalizable, and intuitive explanations for x86 cost models. COMET brings interpretability specifically to ML-based cost models, such as Ithemal. We generate and compare COMET's explanations for Ithemal against COMET's explanations for a hand-crafted, accurate analytical model, uiCA. Our empirical findings show an inverse correlation between the error in the cost prediction of a cost model and the prominence of semantically-richer features in COMET's explanations for the cost model for a given x86 basic block.


Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks

arXiv.org Machine Learning

Statically estimating the number of processor clock cycles it takes to execute a basic block of assembly instructions in steady state (throughput) is important for compiler backend optimizations such as register allocation, instruction selection and instruction scheduling. This is complicated specially in modern x86-64 Complex Instruction Set Computer (CISC) machines with sophisticated processor microarchitectures. Traditionally, compiler writers invest time experimenting and referring to processor manuals to analytically model modern processors with incomplete specifications. This is tedious, error prone and should be done for each processor generation. We present Ithemal, the first automatically learnt estimator to statically predict throughput of a set of basic block instructions using machine learning. Ithemal uses a novel Directed Acyclic Graph-Recurrent Neural Network (DAG-RNN) based data-driven approach for throughput estimation. We show that Ithemal is accurate than state-of-the-art hand written tools used in compiler backends and static machine code analyzers. In particular, our model has a worst case average error of 10.53% on actual throughput values when compared to best case average errors of 19.57% for the LLVM scheduler (llvm-mca) and 22.51% for IACA, Intel's machine code analyzer when compared on three different microarchitectures, while predicting throughput values at a faster rate than aforementioned tools. We also show that Ithemal is portable, learning throughput estimation for Intel Nehalem, Haswell and Skylake microarchitectures without requiring changes to its structure.