SciCode: A Research Coding Benchmark Curated by Scientists

May-29-2025, 02:36:42 GMT–Neural Information Processing Systems

The development of evaluations in tandem with language models (LMs) has substantially contributed to the rapid advancement of these models [30, 12, 8, 26, 83, 28, 74]. Because LMs now surpass the performance of most humans except domain experts, evaluating them becomes increasingly challenging. Many established benchmarks struggle to keep pace with the advancements in LM performance and have quickly become saturated [93, 15, 72, 59], leading to discrepancies between the models' perceived and actual capabilities [37]. As a consequence, researchers are developing synthetic challenging benchmarks, often involving models in the construction of evaluation instances. For example, some subsample instances from existing benchmarks that cannot be solved by current models [95, 84], or augment them to construct more challenging evaluations [22, 45, 50]. However, it is unclear whether such efforts accurately reflect real-world applications and the models' performance in practical scenarios. Realistic, high-quality, and challenging evaluations are crucial for the continued advancement of LMs.

large language model, machine learning, main problem, (20 more...)

Neural Information Processing Systems

May-29-2025, 02:36:42 GMT

Conferences PDF

Add feedback

Country:
- North America > United States
  - Illinois (0.28)
  - Texas (0.28)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Energy (0.67)
- Government > Regional Government
  - North America Government > United States Government (0.46)
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)