pragma
Towards a Comprehensive Benchmark for High-Level Synthesis Targeted to FPGAs
High-level synthesis (HLS) aims to raise the abstraction layer in hardware design, enabling the design of domain-specific accelerators (DSAs) like field-programmable gate arrays (FPGAs) using C/C++ instead of hardware description languages (HDLs). Compiler directives in the form of pragmas play a crucial role in modifying the microarchitecture within the HLS framework. However, the space of possible microarchitectures grows exponentially with the number of pragmas.
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
Lei, Kelun, Yang, Hailong, Zhang, Huaitao, You, Xin, Zhang, Kaige, Luan, Zhongzhi, Liu, Yi, Qian, Depei
Abstract--Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. Results show that PRAGMA consistently outperforms baseline N-PRAGMA without profiling enabled and achieves 2.81 and 2.30 averaged speedups against T orch on CPU and GPU platforms, respectively. Optimizing computational kernels is fundamental to achieving high performance in modern AI and HPC systems. Traditionally, reaching near-peak efficiency has required extensive manual tuning and deep expertise in architecture-specific optimization, making the development and maintenance of high-performance kernels both labor-intensive and error-prone.
VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs
Hayashi, Shun-ichiro, Morita, Koki, Mukunoki, Daichi, Hoshino, Tetsuya, Katagiri, Takahiro
We propose VibeCodeHPC, an automatic tuning system for HPC programs based on multi-agent LLMs for code generation. VibeCodeHPC tunes programs through multi-agent role allocation and iterative prompt refinement. We describe the system configuration with four roles: Project Manager (PM), System Engineer (SE), Programmer (PG), and Continuous Delivery (CD). We introduce dynamic agent deployment and activity monitoring functions to facilitate effective multi-agent collaboration. In our case study, we convert and optimize CPU-based matrix-matrix multiplication code written in C to GPU code using CUDA. The multi-agent configuration of VibeCodeHPC achieved higher-quality code generation per unit time compared to a solo-agent configuration. Additionally, the dynamic agent deployment and activity monitoring capabilities facilitated more effective identification of requirement violations and other issues.
ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation
Jhaveri, Samyak, Klotzmann, Vanessa, Lopes, Crista
The increasing ubiquity of GPUs is accompanied by the increasing complexity of their hardware and parallel programming frameworks. Directive-based parallel programming standards like OpenACC simplify GPU programming to some extent by abstracting away low-level complexities, but a fair amount of expertise is still required in order to use those directives effectively. We introduce ACCeLLiuM, two open weights Large Language Models specifically fine-tuned for generating expert OpenACC directives for data-parallel loops, along with the supervised fine-tuning dataset that was used to train them. The ACCeLLiuM SFT dataset contains 4,033 OpenACC pragma-loop pairs mined from public GitHub C/C++ repositories, with 3,223 pairs for training and 810 for testing. Experimental evaluations show a pronounced performance gap in generating correct OpenACC pragmas between base LLMs and our fine-tuned versions. On the held-out test set, base LLMs fail to consistently generate valid pragmas, whereas LLMs fine-tuned on the ACCeLLiuM dataset generate valid pragmas with the correct directive type for $87\%$ of the data-parallel loops, and exact pragmas--including directives, clauses, clause order, and clause variables--for $50\%$ of the cases. Even when not exact, generated pragmas frequently incorporate the correct clauses in a different order than the ground-truth label, or include additional clauses that enable finer control over parallel execution, data movement, and concurrency, offering practical value beyond strict string-matching. By publicly releasing the code, models, and dataset as ACCeLLiuM we hope to establish a reproducible benchmark for LLM-powered OpenACC pragma generation, and lower the barrier to automated GPU offloading of serially written programs.
Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization
Lange, Robert Tjarko, Sun, Qi, Prasad, Aaditya, Faldor, Maxence, Tang, Yujin, Ha, David
Recent advances in large language models (LLMs) demonstrate their effectiveness in scaling test-time compute for software engineering tasks. However, these approaches often focus on high-level solutions, with limited attention to optimizing low-level CUDA kernel implementations. Additionally, existing kernel generation benchmarks suffer from exploitable loopholes and insufficient diversity in testing conditions, hindering true generalization assessment. To address these limitations, we introduce robust-kbench, a new benchmark for rigorous evaluation of kernel performance and correctness across varied scenarios. Furthermore, we present a comprehensive agentic framework that automates CUDA kernel discovery, verification, and optimization. This pipeline enables frontier LLMs to translate torch code to CUDA kernels and iteratively improve their runtime within our robust evaluation setting. Our sequential workflow first translates PyTorch code into equivalent CUDA kernels. It then optimizes their runtime using a novel evolutionary meta-generation procedure tailored to the CUDA ecosystem, guided by LLM-based verifiers for correctness and efficient filtering. Evaluated on robust-kbench, our approach produces CUDA kernels outperforming torch implementations for practical applications, including forward and backward passes. It can fuse operations and deploy various runtime optimization strategies. The verifier workflow accurately classifies incorrect kernels, enhancing hardware verification efficiency.
ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis
Peng, Zedong, Li, Zeju, Gao, Mingzhe, Xu, Qiang, Zhang, Chen, Zhao, Jieru
High-Level Synthesis (HLS) plays a crucial role in modern hardware design by transforming high-level code into optimized hardware implementations. However, progress in applying machine learning (ML) to HLS optimization has been hindered by a shortage of sufficiently large and diverse datasets. To bridge this gap, we introduce ForgeHLS, a large-scale, open-source dataset explicitly designed for ML-driven HLS research. ForgeHLS comprises over 400k diverse designs generated from 846 kernels covering a broad range of application domains, consuming over 200k CPU hours during dataset construction. Each kernel includes systematically automated pragma insertions (loop unrolling, pipelining, array partitioning), combined with extensive design space exploration using Bayesian optimization. Compared to existing datasets, ForgeHLS significantly enhances scale, diversity, and design coverage. We further define and evaluate representative downstream tasks in Quality of Result (QoR) prediction and automated pragma exploration, clearly demonstrating ForgeHLS utility for developing and improving ML-based HLS optimization methodologies. The dataset and code are public at https://github.com/zedong-peng/ForgeHLS.
LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning
Prakriya, Neha, Ding, Zijian, Sun, Yizhou, Cong, Jason
--FPGAs are increasingly adopted in datacenter environments for their reconfigurability and energy efficiency. High-Level Synthesis (HLS) tools have eased FPGA programming by raising the abstraction level from RTL to untimed C/C++, yet attaining high performance still demands expert knowledge and iterative manual insertion of optimization pragmas to modify the microarchitecture. T o address this challenge, we propose LIFT, a large language model (LLM)-based coding assistant for HLS that automatically generates performance-critical pragmas given a C/C++ design. On average, LIFT produces designs that improve performance by 3.52 and 2.16 than prior state-of the art AutoDSE and HARP respectively, and 66 than GPT -4o. Data center applications require high-performance, low-power, scalable, and reconfigurable hardware. With the end of Dennard's scaling [1], these requirements are becoming increasingly critical to address. FPGAs emerge as a powerful solution and in recent years have been adopted by major cloud providers such as A WS, Microsoft, and Alibaba in their servers. Despite their potential, FPGAs remain challenging to program and deploy efficiently. High-Level Synthesis (HLS) tools such as Vitis HLS [2], Merlin [3], and Intel HLS [4] aim to bridge this gap by raising the abstraction level from low-level RTL to C/C++.
Towards a Comprehensive Benchmark for High-Level Synthesis Targeted to FPGAs
High-level synthesis (HLS) aims to raise the abstraction layer in hardware design, enabling the design of domain-specific accelerators (DSAs) like field-programmable gate arrays (FPGAs) using C/C instead of hardware description languages (HDLs). Compiler directives in the form of pragmas play a crucial role in modifying the microarchitecture within the HLS framework. However, the space of possible microarchitectures grows exponentially with the number of pragmas. To accelerate this process, machine learning models have been used to predict design quality in milliseconds. However, existing open-source datasets for training such models are limited in terms of design complexity and available optimizations. It contains more complex programs with a wider range of optimization pragmas, making it a comprehensive dataset for training and evaluating design quality prediction models.