Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
Kuang, Jiayi, Li, Yinghui, Zhang, Xin, Li, Yangning, Yin, Di, Sun, Xing, Shen, Ying, Yu, Philip S.
–arXiv.org Artificial Intelligence
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
arXiv.org Artificial Intelligence
Oct-30-2025
- Country:
- Asia > China
- Hong Kong (0.04)
- Europe > Austria
- Vienna (0.14)
- North America > United States
- Illinois > Cook County > Chicago (0.04)
- Asia > China
- Genre:
- Research Report (1.00)
- Technology: