GPU Performance Portability needs Autotuning

Ringlein, Burkhard, Parnell, Thomas, Stoica, Radu

Jul-18-2025–arXiv.org Artificial Intelligence

--As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. T oday's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15 more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70 and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors. Large Language Modelss (LLMs) have evolved dramatically in the past years. Besides the improvement in model architectures and training procedures, there have been many innovations in optimizing LLM applications for modern hardware ([1]-[4]).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jul-18-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.68)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Information Technology > Hardware (0.37)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.50)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found