REDO: Execution-Free Runtime Error Detection for COding Agents

Li, Shou, Kan, Andrey, Callot, Laurent, Bhasker, Bhavana, Rashid, Muhammad Shihab, Esler, Timothy B

Oct-10-2024–arXiv.org Artificial Intelligence

As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection. Large language models (LLMs) and LLM-based agents have exhibited significant potential in code generation, code editing, and code evaluation. This progress has culminated in the development of advanced LLM-based agents (hereafter referred to as coding agents) designed to address increasingly complex tasks. For example, SWE-Bench (Jimenez et al., 2024a) presents a demanding benchmark comprising repository-level coding challenges. This benchmark requires coding agents to generate a modification patch that solves a given problem within a GitHub repository, based on a problem statement expressed in natural language. To effectively navigate complex tasks such as those posed by SWE-Bench, coding agents must demonstrate proficiency in the following core competencies: 1) comprehension of the problem statement and retrieving relevant code, 2) reasoning towards a functionally correct solution, and 3) generation of programs free from runtime errors such as SyntaxError, AttributeError, or TypeError. While the majority of coding agents across different tasks focus on enhancing comprehension, retrieval and reasoning capabilities, the systematic detection of runtime errors has received comparatively limited attention. However, ensuring that generated code is free from runtime errors is as critical as the aforementioned capabilities. For example, an AttributeError can cause the modified code to fail, irrespective of the agent's comprehension and reasoning processes.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-10-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.67)
  - Natural Language > Large Language Model (1.00)