SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Motivation

Existing works have established multiple benchmarks to highlight the security risks associated with Code GenAI. These risks are primarily reflected in two areas: a model's potential to generate insecure code (insecure coding) and its utility in cyberattacks (cyberattack helpfulness).

While these benchmarks have made significant strides, there remain opportunities for further improvement.

Many current benchmarks tend to focus more on a model's ability to provide attack suggestions rather than its capacity to generate executable attacks.
Most benchmarks rely heavily on static evaluation metrics (e.g., LLM judgment), which may not be as precise as dynamic metrics such as passing test cases.
Some large-scale benchmarks, while efficiently generated through automated methods, could benefit from more expert verification to ensure data quality and relevance to security scenarios.
Expert-verified benchmarks, while offering high-quality data, often operate at a smaller scale.

To address these gaps, we develop SecCodePLT, a unified and comprehensive evaluation platform for code GenAIs' risks. To the best of our knowledge, this is the first platform to enable precise security risks evaluation and end-to-end cyberattack helpfulness assessment of code GenAI.

Additionally, we are the first to reveal the security risks in Cursor, a popular AI code editor.

Task 1: Insecure Coding

We introduce a two-stage data creation pipeline, which enables scalability and ensures data quality.

Our method starts with generating a few seed samples for each selected type of vulnerability, i.e., one MITRE's Common Weakness Enumeration (CWE) (MITRE, 2024b), and then employs LLM-based mutators to generate more data from these seeds.
We include a validation step to filter out incorrect data, balancing correctness and scalability. More specifically, our seed generation begins by analyzing the vulnerability and context of a given CWE to manually cre- ate several security-related coding tasks.
For each task, we generate both vulnerable and patched code versions, along with functionality and security test cases. In cases that cannot be evaluated with standard test cases, we establish detailed rules for vulnerability detection. Each task, including its description, code and tests, forms a seed.

Given that all seeds are created and verified by human experts, they are guaranteed to be security-related. As detailed in Section 3, our automatic and validate process will also ensure the security relevance and correctness of newly generated data. Additionally, our samples contain both text descriptions and example code, enabling both instruc- tion generation and code completion tasks.

After generating the benchmark, we further design hybrid evaluation metrics that combine dynamic testing with rule-based detection for identifying insecure code. As discussed in Chhabra & Gupta (2010), hybrid metrics are more precise than pure static metrics.

Figure 1: Insecure Coding Data Pipeline

Task 2: Cyberattack Helpfulness

We then construct a cyberattack helpfulness benchmark to evaluate a model's capability in facilitating end-to-end cyberattacks.

Following MITRE ATT&CK (MITRE, 2024a), we break down a typical cyberattack into multiple steps, treating each as a category to guide the model's ability to perform specific aspects of an attack, such as writing exploits and deploying attacks.
We then design tailored prompts for each category to guide the model to generate executable attacks.
Finally, we create an environment with metrics to dynamically evaluate a model's outputted attack for each category.

Figure 2: Cyberattack Helpfulness Evaluation Framework

Key Findings

1. SecCodePLT achieves nearly 100% in both security relevance and instruction faithfulness, demonstrating its high quality. In contrast, CyberSecEval achieves only 68% and 42% on security relevance and instruction faithfulness, with 3 CWEs receiving scores lower than 30%.

Figure 3: Security Relevance Comparison

Figure 4: Instruction Faithfulness Comparison

2. When testing SecCodePLT against SOTA models on instruction generation and code completion tasks, GPT-4o is the most secure model, achieving a 55% secure coding rate. A larger model tends to be more secure. However, there remains significant room for further improvement.

3. Providing security policy reminders to highlight the potential vulnerabilities improves the secure coding rate by approximately 20%.

Figure 5: Instruction Generation Results

Figure 6: Code Completion Results

4. GPT-4o can launch full end-to-end cyberattacks but with a low success rate, while Claude is much safer in assisting attackers implement attacks with over a 90% refusal rate on sensitive attack steps.

Figure 7: Comparison of AI Models' Helpfulness in Cyberattacks

5. Cursor achieves an overall around 60% secure coding rate but fails entirely on some critical CWEs. Besides its different functionalities have different levels of risks.

BibTeX


@misc{yang2024seccodeplt,
  title={SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI}, 
  author={Yu Yang and Yuzhou Nie and Zhun Wang and Yuheng Tang and Wenbo Guo and Bo Li and Dawn Song},
  year={2024},
  eprint={2410.11096},
  archivePrefix={arXiv},
  primaryClass={cs.CR},
  url={https://arxiv.org/abs/2410.11096}, 
}