EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

Mar-21-2025, 21:50:16 GMT–Neural Information Processing Systems

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Many benchmarks have been proposed, but they have two limitations, i.e., data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the latter hinders practitioners from selecting superior LLMs for specific programming domains. To address these two limitations, we propose a new benchmark - EvoCodeBench, which has the following advances: Evolving data. EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Mar-21-2025, 21:50:16 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.28)
- Europe > Austria
  - Vienna (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

Similar Docs Excel Report more

Title	Similarity	Source
None found