Gamblin, Todd
Toward a Cohesive AI and Simulation Software Ecosystem for Scientific Innovation
Heroux, Michael A., Shende, Sameer, McInnes, Lois Curfman, Gamblin, Todd, Willenbring, James M.
ParaTools, Inc. Sameer Shende, ParaTools, Inc. Lois Curfman McInnes, Argonne National Laboratory Todd Gamblin, Lawrence Livermore National Laboratory James M. Willenbring, Sandia National Laboratories In this document, we outline key considerations for the next-generation software stack that will support scientific applications integrating AI and modeling & simulation (ModSim) to provide a unified AI/ModSim software stack. The scientific computing community needs a cohesive AI/ModSim software stack. This AI/ModSim stack must support binary distributions to enable emerging scientific workflows. A Cohesive Software Stack for AI and Modeling & Simulation To address future scientific challenges, the next-generation scientific software stack must provide a cohesive portfolio of libraries and tools that facilitate AI and ModSim approaches. As scientific research becomes increasingly interdisciplinary, scientists require both of these toolsets to address complex, data-rich problems in problem domains such as climate modeling, material discovery, and energy optimization.
Performance-Aligned LLMs for Generating Fast Code
Nichols, Daniel, Polasam, Pranav, Menon, Harshitha, Marathe, Aniruddha, Gamblin, Todd, Bhatele, Abhinav
Optimizing scientific software is a difficult task because codebases are often large and complex, and performance can depend upon several factors including the algorithm, its implementation, and hardware among others. Causes of poor performance can originate from disparate sources and be difficult to diagnose. Recent years have seen a multitude of work that use large language models (LLMs) to assist in software development tasks. However, these tools are trained to model the distribution of code as text, and are not specifically designed to understand performance aspects of code. In this work, we introduce a reinforcement learning based methodology to align the outputs of code LLMs with performance. This allows us to build upon the current code modeling capabilities of LLMs and extend them to generate better performing code. We demonstrate that our fine-tuned model improves the expected speedup of generated code over base models for a set of benchmark tasks from 0.9 to 1.6 for serial code and 1.9 to 4.5 for OpenMP code.
Modeling Parallel Programs using Large Language Models
Nichols, Daniel, Marathe, Aniruddha, Menon, Harshitha, Gamblin, Todd, Bhatele, Abhinav
Parallel software codes in high performance computing (HPC) continue to grow in complexity and scale as we enter the exascale era. A diverse set of emerging hardware and programming paradigms make developing, optimizing, and maintaining parallel software burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. So far, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform. However, with recent advancements in language modeling, and the wealth of code related data that is now available online, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We train LLMs using code and performance data that is specific to parallel codes. We compare several recent LLMs on HPC related tasks and introduce a new model, HPC-Coder, trained on parallel code. In our experiments we show that this model can auto-complete HPC functions where general models cannot, decorate for loops with OpenMP pragmas, and model performance changes in two scientific application repositories.
Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems
Georgakoudis, Giorgis, Parasyris, Konstantinos, Liao, Chunhua, Beckingsale, David, Gamblin, Todd, de Supinski, Bronis
The end of Dennard scaling law -- which stipulated a continuous increase in processor clock frequency by transistor miniaturization -- in conjunction with the continuation of Moore's law -- which expects the number of CMOS transistors within a microchip to double every two years -- shifted the technology trend towards parallel architectures. In the early 2000's parallel computer system architectures focused on multi-core CPU architectures. Later the introduction of the GPGPU paradigms pivoted technology trends to heterogeneous systems composed of both multi-core CPUs and GPUs. This heterogeneity unveiled the challenge of software performance portability. Software performance portability seeks to achieve equivalent performance regardless of the underlying hardware architecture using a single application implementation. Programming models, such as OmpSs [9], OpenMP, Kokkos [10], and RAJA [15], provide abstractions to hide the vendor-specific interfaces required to develop applications on all these heterogeneous parallel architectures and offer unified interfaces to express parallelism. Although these programming models provide a single and convenient layer to implement portable code, the performance of the same application can vary when executed on different architectures and systems. Thus, these programming models efficiently express portable code, but the application performance-portability is unspecified for application executions on different heterogeneous systems. For example, HPC programmers have found that a single version of source code, with an associated static definition of exarXiv:2303.08873v1