You still have to study -- On the Security of LLM generated code

Read original: arXiv:2408.07106 - Published 8/15/2024 by Stefan Goetz, Andreas Schaad

💬

Overview

AI assistants are increasingly used for programming tasks, even in classroom settings.
The code generated based on a programmer's prompt does not always meet accepted security standards.
This could be due to a lack of best-practice examples in the training data or the quality of the programmer's prompt.
This paper analyzes the security of code generated by 4 major Large Language Models (LLMs) using a case study approach for Python and JavaScript, guided by the MITRE CWE catalogue.

Plain English Explanation

The paper explores the use of AI-powered programming assistants, which are becoming more common even in educational settings. However, the code generated by these assistants based on a programmer's input does not always meet accepted security standards. This could be because the training data used to teach the AI models lacks examples of secure coding practices, or because the quality of the programmer's initial prompt influences the security of the generated code.

To investigate this, the researchers used a case study approach to analyze the security of code generated by 4 major LLMs, which are a type of AI model that can understand and generate human-like text. They focused on code written in Python and JavaScript and used the MITRE CWE (Common Weakness Enumeration) catalogue as a guide to evaluate the security of the generated code.

The results show that, depending on the prompting techniques used, some LLMs initially generate code that is deemed insecure by a trained security engineer. However, the researchers found that with increasing manual guidance from a skilled engineer, almost all of the analyzed LLMs can eventually generate code that is close to 100% secure.

Technical Explanation

The paper examines the security of code generated by 4 major LLMs, using a case study approach for the Python and JavaScript programming languages. The researchers used the MITRE CWE catalogue as a guiding reference for defining security standards.

The study found that, depending on the prompting techniques used, some LLMs initially generated code that was deemed insecure by a trained security engineer, with up to 65% of the generated code containing weaknesses. However, the researchers also discovered that with increasing manual guidance from a skilled engineer, almost all of the analyzed LLMs were able to generate code that was close to 100% secure.

This suggests that the quality of the programmer's prompt, as well as the training data used to teach the LLMs, can have a significant impact on the security of the generated code. The researchers hypothesize that the lack of best-practice examples in the training data may be a contributing factor to the initial generation of insecure code.

Critical Analysis

The paper provides valuable insights into the security implications of using AI-powered programming assistants, particularly in educational settings. By highlighting the potential for generated code to contain security vulnerabilities, the research underscores the importance of carefully monitoring and guiding the use of these technologies.

One limitation of the study is the relatively small sample size of 4 LLMs. It would be interesting to see a more comprehensive analysis that includes a wider range of AI models and programming languages. Additionally, the paper does not delve deeply into the specific prompting techniques that led to more secure code generation, which could be a fruitful area for further investigation.

It is also worth considering the broader implications of AI-generated code, such as the potential for unintended consequences or the ethical considerations around the use of these technologies in sensitive domains. As the adoption of AI-powered programming assistants continues to grow, it will be crucial to address these concerns and ensure that the generated code meets appropriate security and quality standards.

Conclusion

This paper sheds light on the security implications of using AI-powered programming assistants, even in educational settings. The researchers found that the security of the generated code can vary significantly depending on the prompting techniques used and the quality of the training data.

While the results suggest that skilled human guidance can help improve the security of the generated code, the findings highlight the need for a more comprehensive understanding of the factors that influence the quality and security of AI-generated code. As the use of these technologies continues to expand, it will be important for researchers, educators, and practitioners to work together to address these challenges and ensure the responsible development and deployment of AI-powered programming tools.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

You still have to study -- On the Security of LLM generated code

Stefan Goetz, Andreas Schaad

We witness an increasing usage of AI-assistants even for routine (classroom) programming tasks. However, the code generated on basis of a so called prompt by the programmer does not always meet accepted security standards. On the one hand, this may be due to lack of best-practice examples in the training data. On the other hand, the actual quality of the programmers prompt appears to influence whether generated code contains weaknesses or not. In this paper we analyse 4 major LLMs with respect to the security of generated code. We do this on basis of a case study for the Python and Javascript language, using the MITRE CWE catalogue as the guiding security definition. Our results show that using different prompting techniques, some LLMs initially generate 65% code which is deemed insecure by a trained security engineer. On the other hand almost all analysed LLMs will eventually generate code being close to 100% secure with increasing manual guidance of a skilled engineer.

8/15/2024

🤷

Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, Anna Muller

With the growing popularity of Large Language Models (LLMs) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to the insecure code generation. First, existing datasets used to evaluate LLMs do not adequately represent genuine software engineering tasks sensitive to security. Instead, they are often based on competitive programming challenges or classroom-type coding tasks. In real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. Second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. Therefore, in this paper, we described SALLM, a framework to benchmark LLMs' abilities to generate secure code systematically. This framework has three major components: a novel dataset of security-centric Python prompts, configurable assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.

9/6/2024

Prompting Techniques for Secure Code Generation: A Systematic Investigation

Catherine Tony, Nicol'as E. D'iaz Ferreyra, Markus Mutas, Salem Dhiff, Riccardo Scandariato

Large Language Models (LLMs) are gaining momentum in software development with prompt-driven programming enabling developers to create code from natural language (NL) instructions. However, studies have questioned their ability to produce secure code and, thereby, the quality of prompt-generated software. Alongside, various prompting techniques that carefully tailor prompts have emerged to elicit optimal responses from LLMs. Still, the interplay between such prompting strategies and secure code generation remains under-explored and calls for further investigations. OBJECTIVE: In this study, we investigate the impact of different prompting techniques on the security of code generated from NL instructions by LLMs. METHOD: First we perform a systematic literature review to identify the existing prompting techniques that can be used for code generation tasks. A subset of these techniques are evaluated on GPT-3, GPT-3.5, and GPT-4 models for secure code generation. For this, we used an existing dataset consisting of 150 NL security-relevant code-generation prompts. RESULTS: Our work (i) classifies potential prompting techniques for code generation (ii) adapts and evaluates a subset of the identified techniques for secure code generation tasks and (iii) observes a reduction in security weaknesses across the tested LLMs, especially after using an existing technique called Recursive Criticism and Improvement (RCI), contributing valuable insights to the ongoing discourse on LLM-generated code security.

7/10/2024

💬

Security Code Review by Large Language Models

Jiaxin Yu, Peng Liang, Yujia Fu, Amjed Tahir, Mojtaba Shahin, Chong Wang, Yangxiao Cai

Security code review, as a time-consuming and labour-intensive process, typically requires integration with automated security defect detection tools to ensure code security. Despite the emergence of numerous security analysis tools, those tools face challenges in terms of their poor generalization, high false positive rates, and coarse detection granularity. A recent development with Large Language Models (LLMs) has made them a promising candidate to support security code review. To this end, we conducted the first empirical study to understand the capabilities of LLMs in security code review, delving into the performance, quality problems, and influential factors of LLMs to detect security defects in code reviews. Specifically, we compared the performance of 6 LLMs under five different prompts with the state-of-the-art static analysis tools to detect and analyze security defects. For the best-performing LLM, we conducted a linguistic analysis to explore quality problems in its responses, as well as a regression analysis to investigate the factors influencing its performance. The results are that: (1) existing pre-trained LLMs have limited capability in detecting security defects during code review but significantly outperform the state-of-the-art static analysis tools. (2) GPT-4 performs best among all LLMs when provided with a CWE list for reference. (3) GPT-4 makes few factual errors but frequently generates unnecessary content or responses that are not compliant with the task requirements given in the prompts. (4) GPT-4 is more adept at identifying security defects in code files with fewer tokens, containing functional logic and written by developers with less involvement in the project.

6/11/2024