Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
Chen, Le, Xu, Nuo, Chen, Winson, Lei, Bin, Lin, Pei-Hung, Zhou, Dunzhi, Thakur, Rajeev, Ding, Caiwen, Jannesari, Ali, Liao, Chunhua
–arXiv.org Artificial Intelligence
Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
arXiv.org Artificial Intelligence
Dec-4-2025
- Country:
- Asia
- Middle East > Iran
- Tehran Province > Tehran (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Middle East > Iran
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Iowa (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Florida > Miami-Dade County
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Technology: