Wang, Liran
MdEval: Massively Multilingual Code Debugging
Liu, Shukai, Chai, Linzheng, Yang, Jian, Shi, Jiajun, Zhu, He, Wang, Liran, Jin, Ke, Zhang, Wei, Zhu, Hualei, Guo, Shuyue, Sun, Tao, Liu, Jiaheng, Duan, Yunlong, Hao, Yu, Yang, Liqun, Niu, Guanglin, Zhang, Ge, Li, Zhoujun
Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.
In-Context Code-Text Learning for Bimodal Software Engineering
Tang, Xunzhu, Wang, Liran, Liu, Yonghui, Chai, Linzheng, Yang, Jian, Li, Zhoujun, Tian, Haoye, Klein, Jacques, Bissyande, Tegawende F.
Bimodal software analysis initially appeared to be within reach with the advent of large language models. Unfortunately, the complex interplay of natural language text and code in software engineering, presents unique challenges that prevent pretrained models to generalize to a variety of tasks. We postulate that in-context learning for the code-text bimodality is a promising avenue. This paper thus introduces a comprehensive study of in-context code-text learning, focusing on leveraging pretrained CodeLLAMA models. We consider a diverse dataset encompassing 23 software engineering tasks, which we transform in an in-context learning format. To effectively extract informative features, we propose a configurable prompt template. Our proposed pipeline, InCTRL, then unifies prompt learning across various software engineering tasks. Extensive evaluation on the study datasets demonstrates the superiority of INCTRL-models in few-shot performance, surpassing state-of-the-art models including the support model, CodeLLAMA. Typically, we observe that applied to the CodeLLAMA model, INCTRL brings improvements in terms of precision (at least about 12\%) and recall (up to 93.88\%) on various tasks. For example, on the task of program repair, INCTRL improves the BLEU score of CodeLLAMA by 85 points, while for clone detection, INCTRL achieves an improvement of 69 percentage points. Moreover, INCTRL-models offer state-of-the-art performance when using retrieval-augmented generation on individual downstream tasks. Finally, we qualitatively analyze the benefits of INCTRL over CodeLLAMA and open-source all models for broader impact. We make our code and dataset publicly available at: \begin{center} {\url{https://anonymous.4open.science/r/inctrl-B65B}} \end{center}