CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Yin, Wenjing, Sun, Tianze, Yu, Yijiong, Fang, Jiawei, Su, Guangyao, Wang, Jiancheng, Wang, Zekun, Wang, Wei, Chen, Ran, Dai, Ziyun, Yuan, Shuai, Dong, Menghang, Luo, Peng, Cao, Dong, Lei, Da, Zhang, Yajun, Chen, Hao, Ma, Xiang, Liu, Yong, Liu, Weifeng, Xu, Yuanjian, Pei, Ji

Apr-30-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive evaluation framework that reflects real-world applications. To address these gaps, we introduce CoCo-Bench (Comprehensive Code Benchmark), designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review. These dimensions capture essential developer needs, ensuring a more systematic and representative evaluation. CoCo-Bench includes multiple programming languages and varying task difficulties, with rigorous manual review to ensure data quality and accuracy. Empirical results show that CoCo-Bench aligns with existing benchmarks while uncovering significant variations in model performance, effectively highlighting strengths and weaknesses. By offering a holistic and objective evaluation, CoCo-Bench provides valuable insights to guide future research and technological advancements in code-oriented LLMs, establishing a reliable benchmark for the field.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Apr-30-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.46)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found