How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
Cao, Jialun, Chan, Yuk-Kit, Ling, Zixuan, Wang, Wenxuan, Li, Shuqing, Liu, Mingwei, Qiao, Ruixi, Han, Yuting, Wang, Chaozheng, Yu, Boxi, He, Pinjia, Wang, Shuai, Zheng, Zibin, Lyu, Michael R., Cheung, Shing-Chi
–arXiv.org Artificial Intelligence
Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.
arXiv.org Artificial Intelligence
Feb-17-2025
- Country:
- Africa
- Ethiopia > Addis Ababa
- Addis Ababa (0.04)
- Rwanda > Kigali
- Kigali (0.04)
- Ethiopia > Addis Ababa
- Asia
- China
- Middle East > Jordan (0.04)
- Singapore (0.04)
- South Korea > Seoul
- Seoul (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- United Kingdom (0.04)
- Finland > Lapland
- Rovaniemi (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Sweden
- Stockholm > Stockholm (0.04)
- Vaestra Goetaland > Gothenburg (0.04)
- Switzerland > Basel-City
- Basel (0.04)
- Italy
- Piedmont > Turin Province
- Turin (0.04)
- Tuscany
- Florence (0.04)
- Pisa Province > Pisa (0.04)
- Piedmont > Turin Province
- France
- Auvergne-Rhône-Alpes > Lyon
- Lyon (0.04)
- Occitanie > Hérault
- Montpellier (0.04)
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Marseille (0.04)
- Auvergne-Rhône-Alpes > Lyon
- Greece > Attica
- Athens (0.04)
- Spain
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Galicia > Madrid (0.04)
- Catalonia > Barcelona Province
- Germany > Berlin (0.04)
- Austria > Vienna (0.14)
- Belgium > Brussels-Capital Region
- North America
- Canada
- British Columbia
- Ontario
- Middlesex County > London (0.04)
- Toronto (0.04)
- Quebec > Montreal (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- New York > New York County
- New York City (0.04)
- California
- San Diego County > San Diego (0.04)
- San Francisco County > San Francisco (0.28)
- Santa Clara County > San Jose (0.04)
- Pennsylvania
- Allegheny County > Pittsburgh (0.04)
- Philadelphia County > Philadelphia (0.04)
- Washington > King County
- Seattle (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Arizona > Maricopa County
- Phoenix (0.04)
- Florida
- Miami-Dade County > Miami (0.04)
- Orange County > Orlando (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New York > New York County
- Canada
- Oceania > Australia
- Africa
- Genre:
- Research Report
- Experimental Study (0.45)
- New Finding (0.46)
- Research Report
- Industry:
- Education (1.00)
- Information Technology > Security & Privacy (0.67)
- Technology: