SciCode: A Research Coding Benchmark Curated by Scientists

Neural Information Processing Systems 

The development of evaluations in tandem with language models (LMs) has substantially contributed to the rapid advancement of these models [30, 12, 8, 26, 83, 28, 74]. Because LMs now surpass the performance of most humans except domain experts, evaluating them becomes increasingly challenging. Many established benchmarks struggle to keep pace with the advancements in LM performance and have quickly become saturated [93, 15, 72, 59], leading to discrepancies between the models' perceived and actual capabilities [37]. As a consequence, researchers are developing synthetic challenging benchmarks, often involving models in the construction of evaluation instances. For example, some subsample instances from existing benchmarks that cannot be solved by current models [95, 84], or augment them to construct more challenging evaluations [22, 45, 50]. However, it is unclear whether such efforts accurately reflect real-world applications and the models' performance in practical scenarios. Realistic, high-quality, and challenging evaluations are crucial for the continued advancement of LMs.