E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
Liu, Jingyao, Huang, Chen, Guan, Zhizhao, Lei, Wenqiang, Deng, Yang
–arXiv.org Artificial Intelligence
The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.
arXiv.org Artificial Intelligence
Oct-27-2025
- Country:
- Asia
- China > Sichuan Province (0.04)
- Singapore (0.04)
- Asia
- Genre:
- Overview (1.00)
- Research Report (1.00)
- Workflow (0.93)
- Industry:
- Information Technology (0.67)
- Technology: