RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Miao, Chunyu, Zou, Henry Peng, Li, Yangning, Chen, Yankai, Wang, Yibo, Wang, Fangxin, Li, Yifan, Yang, Wooseong, He, Bowei, Zhang, Xinni, Yu, Dianzhi, Yang, Hanchen, Nguyen, Hoang H, Zhou, Yue, Yang, Jie, Guo, Jizhou, Fan, Wenzhe, Yeh, Chin-Yuan, Meng, Panpan, Fang, Liancheng, Qi, Jinhu, Huang, Wei-Chieh, Gu, Zhengyao, Han, Yuwei, He, Langzhou, Yang, Yuyao, Li, Yinghui, Zheng, Hai-Tao, Liu, Xue, King, Irwin, Yu, Philip S.

Oct-27-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) have been increasingly adopted across the scientific research pipeline, assisting tasks from ideation to writing (Zhang et al., 2025; Si et al., 2024). However, generating correct and executable research code remains a difficult problem, not only because it requires long-range reasoning and robust verification (Padigela et al., 2025; Starace et al., 2025; Zhu et al., 2025), but also because the input contexts in research settings are often complex, indirect, and noisy. Research papers describe methods through high-level narratives, mathematical formulas, and domain-specific conventions, with many implementation details left implicit. As a result, translating these fragmented and underspecified descriptions into functional code remains a fundamental challenge for current LLMs (Li et al., 2025b;a). Existing benchmarks for research code generation (Zheng et al., 2023; Sun et al., 2023; Toledo et al., 2025; Hua et al., 2025) primarily evaluate models in a non-interactive setting, where they are expected to produce correct code in a single response. This design neglects the crucial role of human feedback in realistic workflows: on the one hand, users often cannot fully specify their requirements in one shot.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Oct-27-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)
- North America (0.28)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found