RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
Miao, Chunyu, Zou, Henry Peng, Li, Yangning, Chen, Yankai, Wang, Yibo, Wang, Fangxin, Li, Yifan, Yang, Wooseong, He, Bowei, Zhang, Xinni, Yu, Dianzhi, Yang, Hanchen, Nguyen, Hoang H, Zhou, Yue, Yang, Jie, Guo, Jizhou, Fan, Wenzhe, Yeh, Chin-Yuan, Meng, Panpan, Fang, Liancheng, Qi, Jinhu, Huang, Wei-Chieh, Gu, Zhengyao, Han, Yuwei, He, Langzhou, Yang, Yuyao, Li, Yinghui, Zheng, Hai-Tao, Liu, Xue, King, Irwin, Yu, Philip S.
–arXiv.org Artificial Intelligence
Large language models (LLMs) have been increasingly adopted across the scientific research pipeline, assisting tasks from ideation to writing (Zhang et al., 2025; Si et al., 2024). However, generating correct and executable research code remains a difficult problem, not only because it requires long-range reasoning and robust verification (Padigela et al., 2025; Starace et al., 2025; Zhu et al., 2025), but also because the input contexts in research settings are often complex, indirect, and noisy. Research papers describe methods through high-level narratives, mathematical formulas, and domain-specific conventions, with many implementation details left implicit. As a result, translating these fragmented and underspecified descriptions into functional code remains a fundamental challenge for current LLMs (Li et al., 2025b;a). Existing benchmarks for research code generation (Zheng et al., 2023; Sun et al., 2023; Toledo et al., 2025; Hua et al., 2025) primarily evaluate models in a non-interactive setting, where they are expected to produce correct code in a single response. This design neglects the crucial role of human feedback in realistic workflows: on the one hand, users often cannot fully specify their requirements in one shot.
arXiv.org Artificial Intelligence
Oct-27-2025
- Country:
- Asia
- North America
- Canada > Quebec
- Montreal (0.04)
- United States > Illinois
- Cook County > Chicago (0.04)
- Canada > Quebec
- Genre:
- Research Report > New Finding (0.46)
- Technology: