EFFIBENCH-X: AMulti-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Neural Information Processing Systems 

Existing code generation benchmarks primarily evaluate functional correctness, with limited attention to code efficiency, and they are often restricted to a single language such as Python. To address this gap, we introduce EFFIBENCH-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EFFIBENCH-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EFFIBENCH-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around 62% of human efficiency on average, with significant language-specific variations.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found