SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Yang, Yu, Nie, Yuzhou, Wang, Zhun, Tang, Yuheng, Guo, Wenbo, Li, Bo, Song, Dawn

Oct-14-2024–arXiv.org Artificial Intelligence

Existing works have established multiple benchmarks to highlight the security risks associated with Code GenAI. These risks are primarily reflected in two areas: a model's potential to generate insecure code (insecure coding) and its utility in cyberattacks (cyberattack helpfulness). While these benchmarks have made significant strides, there remain opportunities for further improvement. For instance, many current benchmarks tend to focus more on a model's ability to provide attack suggestions rather than its capacity to generate executable attacks. Additionally, most benchmarks rely heavily on static evaluation metrics (e.g., LLM judgment), which may not be as precise as dynamic metrics such as passing test cases. Furthermore, some large-scale benchmarks, while efficiently generated through automated methods, could benefit from more expert verification to ensure data quality and relevance to security scenarios. Conversely, expert-verified benchmarks, while offering high-quality data, often operate at a smaller scale. For insecure code, we introduce a new methodology for data creation that combines experts with automatic generation. Our methodology ensures the data quality while enabling large-scale generation. We also associate samples with test cases to conduct code-related dynamic evaluation. For cyberattack helpfulness, we set up a real environment and construct samples to prompt a model to generate actual attacks, along with dynamic metrics in our environment. Furthermore, it better identifies the security risks of SOTA models in insecure coding and cyberattack helpfulness. PLT to the SOTA code agent, Cursor, and, for the first time, identify non-trivial security risks in this advanced coding agent. Code GenAI, including specific code generation models and general large language models, have shown remarkable capabilities in code generation (Austin et al., 2021; Chen et al., 2021; DeepSeek, 2022; Dong et al., 2023; Hui et al., 2024), reasoning (Gu et al., 2024), and debugging (Tian et al., 2024). Together with these exciting new capabilities comes concern over these models' security risks. Recent research (Bhatt et al., 2023; Pearce et al., 2022) showed that code GenAI can produce insecure code, which significantly hinders the real-world deployment of AI-generated code. Moreover, these models can also be weaponized to facilitate cyberattacks. To understand the security risks of code GenAI, existing works developed several benchmarks to evaluate a code generation model's risk in producing insecure or vulnerable code (insecure coding) (Bhatt et al., 2023; 2024), as well as its potential to facilitate cyberattacks (cyberattack helpfulness) (Bhatt et al., 2024; Yuan et al., 2024). However, as demonstrated in Table 1, these benchmarks are foundationally limited.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Oct-14-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.46)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Government > Military
  - Cyberwarfare (1.00)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language
      - Chatbot (1.00)
      - Large Language Model (1.00)
    - Representation & Reasoning (1.00)
  - Security & Privacy (1.00)