RedCode: Risky Code Execution and Generation Benchmark for Code Agents
–Neural Information Processing Systems
With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding and software development, safety and security concerns, such as generating or executing malicious code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and highquality safety scenarios and tests. RedCode consists of two parts to evaluate agents' safety in unsafe code execution and generation: (1) RedCode-Exec provides challenging code prompts in Python as inputs, aiming to evaluate code agents' ability to recognize and handle unsafe code. We then map the Python code to other programming languages (e.g., Bash) and natural text summaries or descriptions for evaluation, leading to a total of over 4,000 testing instances. We provide 25 types of critical vulnerabilities spanning various domains, such as websites, file systems, and operating systems. We provide a Docker sandbox environment to evaluate the execution capabilities of code agents and design corresponding evaluation metrics to assess their execution results.
Neural Information Processing Systems
May-25-2025, 16:32:47 GMT
- Country:
- North America > United States > Illinois (0.28)
- Genre:
- Research Report > New Finding (0.45)
- Workflow (0.67)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: