EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

Neural Information Processing Systems 

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Many benchmarks have been proposed, but they have two limitations, i.e., data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the latter hinders practitioners from selecting superior LLMs for specific programming domains. To address these two limitations, we propose a new benchmark - EvoCodeBench, which has the following advances: Evolving data. EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found