Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML
John, Chelsea Maria, Nassyr, Stepan, Penke, Carolin, Herten, Andreas
–arXiv.org Artificial Intelligence
Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML Chelsea Maria John, Stepan Nassyr, Carolin Penke, Andreas Herten J ulich Supercomputing Centre F orschungszentrum J ulich J ulich, Germany Abstract --The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail. I NTRODUCTION Fueled by the growing interest in training ever larger deep neural networks, such as large language models and other foundation models, the demands for hardware specialized on these workloads have grown immensely. Graphics processing units (GPUs) have evolved from their origins in computer graphics to become the primary computational engines of the AI revolution. While the central processing unit (CPU) controls a program's execution flow, it offloads compute-intensive highly-parallel tasks to the GPU (the accelerator). Evolving from a pioneering company, NVIDIA has emerged as the dominant player in the market as of 2024, spearheading current hardware developments. Other vendors, such as AMD and Intel, also provide GPUs aiming to accelerate model training and inference. Another promising class of AI accelerators is based on the idea of distributed local per-compute-unit memory together with on-chip message passing, in contrast to a shared memory hierarchy, typical to classical CPUs and GPUs. Performance characteristics not only vary between generations and vendors, but depend on the node or cluster configuration in which the accelerator is embedded, including CPU, memory, and interconnect. When evaluating and comparing these heterogeneous hardware options, e.g. for purchase decisions in an academic or industrial setting, it is not sufficient to compare hardware characteristics such as number of cores, thermal design power (TDP), theoretic bandwidth, or peak performance in F LO P/s . Their effect on workload performance is not straightforward, and the accelerator architectures might barely be comparable. Performance data reflecting the actual intended workloads, collected on various competing systems independently of vendor interests, offer highly valuable information. Power consumption is one such important metric in this regard.
arXiv.org Artificial Intelligence
Oct-29-2024