Goto

Collaborating Authors

 code generation benchmark


AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Chou, Jason, Liu, Ao, Deng, Yuchi, Zeng, Zhiying, Zhang, Tao, Zhu, Haotian, Cai, Jianwei, Mao, Yue, Zhang, Chenchen, Tan, Lingyun, Xu, Ziyan, Zhai, Bohui, Liu, Hengyi, Zhu, Speed, Zhou, Wiggin, Lian, Fengzong

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.


IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs

Neural Information Processing Systems

Infrastructure-as-Code (IaC), an important component of cloud computing, allows the definition of cloud infrastructure in high-level programs. However, developing IaC programs is challenging, complicated by factors that include the burgeoning complexity of the cloud ecosystem (e.g., diversity of cloud services and workloads), and the relative scarcity of IaC-specific code examples and public repositories. While large language models (LLMs) have shown promise in general code generation and could potentially aid in IaC development, no benchmarks currently exist for evaluating their ability to generate IaC code. IaC-Eval's dataset includes 458 human-curated scenarios covering a wide range of popular AWS services, at varying difficulty levels. Each scenario mainly comprises a natural language IaC problem description and an infrastructure intent specification.


Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval

Wu, Jiarong, Chen, Songqiang, Cao, Jialun, Lo, Hau Ching, Cheung, Shing-Chi

arXiv.org Artificial Intelligence

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the generated code in specific programming languages. However, the evaluation scores revealed in this way provide a little hint as to the bottleneck of the code generation -- whether LLMs are struggling with their problem-solving capability or language-coding capability. To answer this question, we construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input. By doing so, the bottleneck of code generation in various programming languages could be isolated and identified. Our study yields several interesting findings. For example, we identify that the bottleneck of LLMs in Python programming is problem-solving, while Rust is struggling relatively more in language-coding. Also, our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort, especially for undertrained programming languages. Finally, we release the pipeline of constructing PseudoEval to facilitate the extension to existing benchmarks. PseudoEval is available at: https://anonymous.4open.science/r/PseudocodeACL25-7B74.


Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

Liang, Shanchao, Hu, Yiran, Jiang, Nan, Tan, Lin

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions. To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations of ten LLMs, none of the models achieve more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. REPOCOD is available at https://github.com/lt-asset/REPOCOD


Large Language Models for Code Summarization

Szalontai, Balázs, Szalay, Gergő, Márton, Tamás, Sike, Anna, Pintér, Balázs, Gregorics, Tibor

arXiv.org Artificial Intelligence

The introduction of Encoder-Decoder architectures in natural language processing [26] (both recurrent [6] and Transformer-based [29]) has motivated researchers to apply them to software engineering. One important application is generating summaries of code [25, 2, 11]. A code summarization tool is useful for example to understand legacy code or to create documentation. Since the spread of Large Language Models (LLMs), the working programmer has many more opportunities to use deep learning-based tools. Closed models (such as GPT-4 [21] or Gemini [27]) and open models (such as CodeLlama [24] or WizardCoder [19]) demonstrate impressive capabilities of generating source code based on a task description, as well as generating natural-language summary of code. The main objective of this technical report is to investigate how well open-sourced LLMs handle source code in relation with natural language text. In particular, we discuss results of some of the most acknowledged open-source LLMs, focusing on their code summarization/explanation (code-to-text) capabilities. We also discuss code generation (text-to-code) capabilities of these LLMs, as this is often considered to be their most defining capability. That is, LLMs are often ranked simply based on results on a code generation benchmark.