GPU Performance Portability needs Autotuning

Ringlein, Burkhard, Parnell, Thomas, Stoica, Radu

arXiv.org Artificial Intelligence 

--As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. T oday's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15 more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70 and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors. Large Language Modelss (LLMs) have evolved dramatically in the past years. Besides the improvement in model architectures and training procedures, there have been many innovations in optimizing LLM applications for modern hardware ([1]-[4]).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found