GPU Performance Portability needs Autotuning
Ringlein, Burkhard, Parnell, Thomas, Stoica, Radu
–arXiv.org Artificial Intelligence
--As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. T oday's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15 more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70 and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors. Large Language Modelss (LLMs) have evolved dramatically in the past years. Besides the improvement in model architectures and training procedures, there have been many innovations in optimizing LLM applications for modern hardware ([1]-[4]).
arXiv.org Artificial Intelligence
Jul-18-2025
- Country:
- Europe
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Sweden > Vaestra Goetaland
- North America > United States
- Arizona > Maricopa County > Phoenix (0.04)
- Europe
- Genre:
- Research Report > New Finding (0.88)
- Industry:
- Information Technology > Hardware (0.37)
- Technology: