Srivatsa, Mudhakar
HadaCore: Tensor Core Accelerated Hadamard Transform Kernel
Agarwal, Krish, Astra, Rishi, Hoque, Adnan, Srivatsa, Mudhakar, Ganti, Raghu, Wright, Less, Chen, Sijia
We present HadaCore, a modified Fast Walsh-Hadamard Transform (FWHT) algorithm optimized for the Tensor Cores present in modern GPU hardware. HadaCore follows the recursive structure of the original FWHT algorithm, achieving the same asymptotic runtime complexity but leveraging a hardware-aware work decomposition that benefits from Tensor Core acceleration. This reduces bottlenecks from compute and data exchange. On Nvidia A100 and H100 GPUs, HadaCore achieves speedups of 1.1-1.4x and 1.0-1.3x, with a peak gain of 3.5x and 3.6x respectively, when compared to the existing state-of-the-art implementation of the original algorithm. We also show that when using FP16 or BF16, our implementation is numerically accurate, enabling comparable accuracy on MMLU benchmarks when used in an end-to-end Llama3 inference run with quantized (FP8) attention.
Transforming the Hybrid Cloud for Emerging AI Workloads
Chen, Deming, Youssef, Alaa, Pendse, Ruchi, Schleife, Andrรฉ, Clark, Bryan K., Hamann, Hendrik, He, Jingrui, Laino, Teodoro, Varshney, Lav, Wang, Yuxiong, Sil, Avirup, Jabbarvand, Reyhaneh, Xu, Tianyin, Kindratenko, Volodymyr, Costa, Carlos, Adve, Sarita, Mendis, Charith, Zhang, Minjia, Nรบรฑez-Corrales, Santiago, Ganti, Raghu, Srivatsa, Mudhakar, Kim, Nam Sung, Torrellas, Josep, Huang, Jian, Seelam, Seetharami, Nahrstedt, Klara, Abdelzaher, Tarek, Eilam, Tamar, Zhao, Huimin, Manica, Matteo, Iyer, Ravishankar, Hirzel, Martin, Adve, Vikram, Marinov, Darko, Franke, Hubertus, Tong, Hanghang, Ainsworth, Elizabeth, Zhao, Han, Vasisht, Deepak, Do, Minh, Oliveira, Fabio, Pacifici, Giovanni, Puri, Ruchir, Nagpurkar, Priya
This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, full-stack co-design approaches, emphasizing usability, manageability, affordability, adaptability, efficiency, and scalability. By integrating cutting-edge technologies such as generative and agentic AI, cross-layer automation and optimization, unified control plane, and composable and adaptive system architecture, the proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. Incorporating quantum computing as it matures will enable quantum-accelerated simulations for materials science, climate modeling, and other high-impact domains. Collaborative efforts between academia and industry are central to this vision, driving advancements in foundation models for material design and climate solutions, scalable multimodal data processing, and enhanced physics-based AI emulators for applications like weather forecasting and carbon sequestration. Research priorities include advancing AI agentic systems, LLM as an Abstraction (LLMaaA), AI model optimization and unified abstractions across heterogeneous infrastructure, end-to-end edge-cloud transformation, efficient programming model, middleware and platform, secure infrastructure, application-adaptive cloud systems, and new quantum-classical collaborative workflows. These ideas and solutions encompass both theoretical and practical research questions, requiring coordinated input and support from the research community. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms, fostering breakthroughs in AI-driven applications and scientific discovery across academia, industry, and society.
CaloChallenge 2022: A Community Challenge for Fast Calorimeter Simulation
Krause, Claudius, Giannelli, Michele Faucci, Kasieczka, Gregor, Nachman, Benjamin, Salamani, Dalila, Shih, David, Zaborowska, Anna, Amram, Oz, Borras, Kerstin, Buckley, Matthew R., Buhmann, Erik, Buss, Thorsten, Cardoso, Renato Paulo Da Costa, Caterini, Anthony L., Chernyavskaya, Nadezda, Corchia, Federico A. G., Cresswell, Jesse C., Diefenbacher, Sascha, Dreyer, Etienne, Ekambaram, Vijay, Eren, Engin, Ernst, Florian, Favaro, Luigi, Franchini, Matteo, Gaede, Frank, Gross, Eilam, Hsu, Shih-Chieh, Jaruskova, Kristina, Kรคch, Benno, Kalagnanam, Jayant, Kansal, Raghav, Kim, Taewoo, Kobylianskii, Dmitrii, Korol, Anatolii, Korcari, William, Krรผcker, Dirk, Krรผger, Katja, Letizia, Marco, Li, Shu, Liu, Qibin, Liu, Xiulong, Loaiza-Ganem, Gabriel, Madula, Thandikire, McKeown, Peter, Melzer-Pellmann, Isabell-A., Mikuni, Vinicius, Nguyen, Nam, Ore, Ayodele, Schweitzer, Sofia Palacios, Pang, Ian, Pedro, Kevin, Plehn, Tilman, Pokorski, Witold, Qu, Huilin, Raikwar, Piyush, Raine, John A., Reyes-Gonzalez, Humberto, Rinaldi, Lorenzo, Ross, Brendan Leigh, Scham, Moritz A. W., Schnake, Simon, Shimmin, Chase, Shlizerman, Eli, Soybelman, Nathalie, Srivatsa, Mudhakar, Tsolaki, Kalliopi, Vallecorsa, Sofia, Yeo, Kyongmin, Zhang, Rui
We present the results of the "Fast Calorimeter Simulation Challenge 2022" -- the CaloChallenge. We study state-of-the-art generative models on four calorimeter shower datasets of increasing dimensionality, ranging from a few hundred voxels to a few tens of thousand voxels. The 31 individual submissions span a wide range of current popular generative architectures, including Variational AutoEncoders (VAEs), Generative Adversarial Networks (GANs), Normalizing Flows, Diffusion models, and models based on Conditional Flow Matching. We compare all submissions in terms of quality of generated calorimeter showers, as well as shower generation time and model size. To assess the quality we use a broad range of different metrics including differences in 1-dimensional histograms of observables, KPD/FPD scores, AUCs of binary classifiers, and the log-posterior of a multiclass classifier. The results of the CaloChallenge provide the most complete and comprehensive survey of cutting-edge approaches to calorimeter fast simulation to date. In addition, our work provides a uniquely detailed perspective on the important problem of how to evaluate generative models. As such, the results presented here should be applicable for other domains that use generative AI and require fast and faithful generation of samples in a large phase space.
Accelerating Production LLMs with Combined Token/Embedding Speculators
Wertheimer, Davis, Rosenkranz, Joshua, Parnell, Thomas, Suneja, Sahil, Ranganathan, Pavithra, Ganti, Raghu, Srivatsa, Mudhakar
One approach to squaring this circle is speculative decoding, where a smaller draft model or speculator is trained This technical report describes the design and training to predict multiple tokens given a sequence of input. These of novel speculative decoding draft models, for accelerating speculative tokens are produced with low cost, and lower the inference speeds of large language models in a accuracy than the base LLM. However, we can leverage production environment. By conditioning draft predictions GPU parallelism during the LLM forward pass to evaluate on both context vectors and sampled tokens, we can train the output for each of these new tokens with minimal additional our speculators to efficiently predict high-quality n-grams, overhead. Then, by comparing the outputs to the which the base model then accepts or rejects. This allows us speculated inputs, we can accept all the predicted tokens to effectively predict multiple tokens per inference forward that match the output of the base model, while rejecting all pass, accelerating wall-clock inference speeds of highly optimized those that don't. In this way we can predict multiple tokens base model implementations by a factor of 2-3x. We per LLM forward pass at minimal extra cost. A deeper explore these initial results and describe next steps for further explanation of speculative decoding can be found in [3, 6].
SudokuSens: Enhancing Deep Learning Robustness for IoT Sensing Applications using a Generative Approach
Wang, Tianshi, Li, Jinyang, Wang, Ruijie, Kara, Denizhan, Liu, Shengzhong, Wertheimer, Davis, Viros-i-Martin, Antoni, Ganti, Raghu, Srivatsa, Mudhakar, Abdelzaher, Tarek
This paper introduces SudokuSens, a generative framework for automated generation of training data in machine-learning-based Internet-of-Things (IoT) applications, such that the generated synthetic data mimic experimental configurations not encountered during actual sensor data collection. The framework improves the robustness of resulting deep learning models, and is intended for IoT applications where data collection is expensive. The work is motivated by the fact that IoT time-series data entangle the signatures of observed objects with the confounding intrinsic properties of the surrounding environment and the dynamic environmental disturbances experienced. To incorporate sufficient diversity into the IoT training data, one therefore needs to consider a combinatorial explosion of training cases that are multiplicative in the number of objects considered and the possible environmental conditions in which such objects may be encountered. Our framework substantially reduces these multiplicative training needs. To decouple object signatures from environmental conditions, we employ a Conditional Variational Autoencoder (CVAE) that allows us to reduce data collection needs from multiplicative to (nearly) linear, while synthetically generating (data for) the missing conditions. To obtain robustness with respect to dynamic disturbances, a session-aware temporal contrastive learning approach is taken. Integrating the aforementioned two approaches, SudokuSens significantly improves the robustness of deep learning for IoT applications. We explore the degree to which SudokuSens benefits downstream inference tasks in different data sets and discuss conditions under which the approach is particularly effective.
TP-Aware Dequantization
Hoque, Adnan, Srivatsa, Mudhakar, Yang, Chih-Chieh, Ganti, Raghu
Given the recent advancement of LLMs, deployment optimizations are becoming more crucial as the size of state-ofthe-art LLMs increase in scale. As these these models continue to grow, so does the need to optimize the increasingly parallel and increasingly distributed workload requirements of modern-day deep learning inference. Strategies like GPTQ [1] and Tensor Parallel (TP) [4] are hence essential in achieving high-throughput performance. Our method is motivated by several key properties of GPTQ, TP and General Matrix Multiplication (GEMM). We build on these existing methods and present a key innovation that helps maximize memory throughput and reduce latency. Our method shows up to a 1.81x speedup on Llama-70B and up to a 1.78x speedup on Granite-20B MLP layer problem sizes. We achieve this by reducing global communication and enforcing data locality.
Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition
Hoque, Adnan, Wright, Less, Yang, Jamie, Srivatsa, Mudhakar, Ganti, Raghu
We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65% speed improvement on A100, and an average of 124% speed improvement on H100 (with a peak of 295%) for a range of matrix dimensions including those found in a llama-style model, where m < n = k.
State Action Separable Reinforcement Learning
Zhang, Ziyao, Ma, Liang, Leung, Kin K., Poularakis, Konstantinos, Srivatsa, Mudhakar
Reinforcement Learning (RL) based methods have seen their paramount successes in solving serial decision-making and control problems in recent years. For conventional RL formulations, Markov Decision Process (MDP) and state-action-value function are the basis for the problem modeling and policy evaluation. However, several challenging issues still remain. Among most cited issues, the enormity of state/action space is an important factor that causes inefficiency in accurately approximating the state-action-value function. We observe that although actions directly define the agents' behaviors, for many problems the next state after a state transition matters more than the action taken, in determining the return of such a state transition. In this regard, we propose a new learning paradigm, State Action Separable Reinforcement Learning (sasRL), wherein the action space is decoupled from the value function learning process for higher efficiency. Then, a light-weight transition model is learned to assist the agent to determine the action that triggers the associated state transition. In addition, our convergence analysis reveals that under certain conditions, the convergence time of sasRL is $O(T^{1/k})$, where $T$ is the convergence time for updating the value function in the MDP-based formulation and $k$ is a weighting factor. Experiments on several gaming scenarios show that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to $75\%$.
neuralRank: Searching and ranking ANN-based model repositories
Desai, Nirmit, Chu, Linsong, Ganti, Raghu K., Stein, Sebastian, Srivatsa, Mudhakar
Widespread applications of deep learning have led to a plethora of pre-trained neural network models for common tasks. Such models are often adapted from other models via transfer learning. The models may have varying training sets, training algorithms, network architectures, and hyper-parameters. For a given application, what isthe most suitable model in a model repository? This is a critical question for practical deployments but it has not received much attention. This paper introduces the novel problem of searching and ranking models based on suitability relative to a target dataset and proposes a ranking algorithm called \textit{neuralRank}. The key idea behind this algorithm is to base model suitability on the discriminating power of a model, using a novel metric to measure it. With experimental results on the MNIST, Fashion, and CIFAR10 datasets, we demonstrate that (1) neuralRank is independent of the domain, the training set, or the network architecture and (2) that the models ranked highly by neuralRank ranking tend to have higher model accuracy in practice.