A Case Study of Large Language Models (ChatGPT and CodeBERT) for Security-Oriented Code Analysis

Read original: arXiv:2307.12488 - Published 7/30/2024 by Zhilong Wang, Lan Zhang, Chen Cao, Nanqing Luo, Xinzhi Luo, Peng Liu

💬

Overview

Evaluates the capabilities and limitations of large language models (LLMs) like ChatGPT and CodeBERT in security-oriented program analysis tasks
Focuses on how these models perform on tasks like code review, vulnerability analysis, and other security-related code analysis
Examines the models' strengths in learning high-level code semantics as well as their weaknesses, such as reliance on well-defined variable and function names

Plain English Explanation

Large language models (LLMs) like ChatGPT and CodeBERT have shown promise in code analysis tasks, but their specific strengths and limitations in the security domain are not yet fully understood. This paper delves into how these LLMs perform on typical security-focused code analysis tasks, such as reviewing code for vulnerabilities or identifying security issues.

The researchers found that the LLMs are quite effective at understanding the high-level meaning and semantics of code, making them potentially useful tools for security analysts. For example, ChatGPT was able to efficiently identify and explain security-related problems in code.

However, the models also have significant limitations. They rely heavily on the quality of variable and function names in the code, and struggle to learn from code that has been anonymized or obfuscated. This means the models may not be able to effectively analyze code that has been intentionally obscured, as is common in malicious software.

Overall, the paper suggests that LLMs can be valuable assets in security-oriented code analysis, but also highlights important areas for further research and development to address the current limitations.

Technical Explanation

The researchers evaluated the performance of two representative LLMs, ChatGPT and CodeBERT, on a variety of security-focused code analysis tasks. These tasks ranged in difficulty and were designed to assess the models' capabilities from the perspectives of both attackers and security analysts.

The results showed that the LLMs were quite effective at learning high-level semantic information from the code, allowing them to efficiently identify and explain security-related issues. For example, ChatGPT was able to accurately detect common vulnerabilities like SQL injection and buffer overflow, and provide detailed explanations of the problems.

However, the models' performance was heavily dependent on the quality of the variable and function names in the code. When the code was anonymized or obfuscated, the LLMs struggled to learn the necessary information, limiting their effectiveness. This is a significant limitation, as code obfuscation is a common technique used by malicious actors to evade detection.

The researchers also found that the LLMs performed better on tasks that required high-level understanding of code semantics, rather than low-level, detailed analysis. This suggests that the models may be more useful for tasks like code review and vulnerability identification, rather than more complex security analysis.

Critical Analysis

The paper highlights important strengths and limitations of using LLMs for security-oriented code analysis, and raises several critical questions for further research.

One key limitation is the models' heavy reliance on well-defined variable and function names. This makes them vulnerable to techniques like code obfuscation, which are commonly used by malicious actors to evade detection. Addressing this limitation would be crucial for the effective deployment of LLMs in security-critical applications.

Additionally, the paper notes that the LLMs performed better on tasks requiring high-level understanding of code semantics, rather than low-level, detailed analysis. This suggests that the models may have limitations in more complex security analysis tasks, and raises questions about their suitability for certain security-critical applications.

The researchers also acknowledge the need for further investigation into the safety and generalization challenges of using LLMs in security-oriented contexts. As these models become more widely adopted, it will be crucial to carefully evaluate their robustness and reliability in the face of adversarial attacks or unexpected inputs.

Overall, the paper provides a valuable starting point for understanding the potential and limitations of LLMs in security-oriented code analysis. However, the concerns raised deserve in-depth investigation to ensure the responsible and effective deployment of these models in security-critical applications.

Conclusion

This study offers important insights into the capabilities and limitations of using large language models like ChatGPT and CodeBERT for security-oriented code analysis. While the LLMs demonstrate efficiency in learning high-level code semantics, their heavy reliance on well-defined variable and function names poses a significant limitation, particularly in the face of common obfuscation techniques used by malicious actors.

The findings suggest that these LLMs may be valuable assets in certain security-focused tasks, such as code review and vulnerability identification. However, their suitability for more complex security analysis remains an open question that requires further investigation. As the use of LLMs in security-critical applications continues to grow, it will be crucial to address the concerns raised in this study to ensure the responsible and effective deployment of these powerful models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

A Case Study of Large Language Models (ChatGPT and CodeBERT) for Security-Oriented Code Analysis

Zhilong Wang, Lan Zhang, Chen Cao, Nanqing Luo, Xinzhi Luo, Peng Liu

The Large Language Models (LLMs), such as GPT and BERT, were proposed for natural language processing (NLP) and have shown promising results as general-purpose language models. An increasing number of industry professionals and researchers are adopting LLMs for program analysis tasks. However, one significant difference between programming languages and natural languages is that a programmer has the flexibility to assign any names to variables, methods, and functions in the program, whereas a natural language writer does not. Intuitively, the quality of naming in a program affects the performance of LLMs in program analysis tasks. This paper investigates how naming affects LLMs on code analysis tasks. Specifically, we create a set of datasets with code containing nonsense or misleading names for variables, methods, and functions, respectively. We then use well-trained models (CodeBERT) to perform code analysis tasks on these datasets. The experimental results show that naming has a significant impact on the performance of code analysis tasks based on LLMs, indicating that code representation learning based on LLMs heavily relies on well-defined names in code. Additionally, we conduct a case study on some special code analysis tasks using GPT, providing further insights.

7/30/2024

💬

Security Code Review by Large Language Models

Jiaxin Yu, Peng Liang, Yujia Fu, Amjed Tahir, Mojtaba Shahin, Chong Wang, Yangxiao Cai

Security code review, as a time-consuming and labour-intensive process, typically requires integration with automated security defect detection tools to ensure code security. Despite the emergence of numerous security analysis tools, those tools face challenges in terms of their poor generalization, high false positive rates, and coarse detection granularity. A recent development with Large Language Models (LLMs) has made them a promising candidate to support security code review. To this end, we conducted the first empirical study to understand the capabilities of LLMs in security code review, delving into the performance, quality problems, and influential factors of LLMs to detect security defects in code reviews. Specifically, we compared the performance of 6 LLMs under five different prompts with the state-of-the-art static analysis tools to detect and analyze security defects. For the best-performing LLM, we conducted a linguistic analysis to explore quality problems in its responses, as well as a regression analysis to investigate the factors influencing its performance. The results are that: (1) existing pre-trained LLMs have limited capability in detecting security defects during code review but significantly outperform the state-of-the-art static analysis tools. (2) GPT-4 performs best among all LLMs when provided with a CWE list for reference. (3) GPT-4 makes few factual errors but frequently generates unnecessary content or responses that are not compliant with the task requirements given in the prompts. (4) GPT-4 is more adept at identifying security defects in code files with fewer tokens, containing functional logic and written by developers with less involvement in the project.

6/11/2024

A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality

M. Mehdi Kholoosi, M. Ali Babar, Roland Croft

Artificial Intelligence (AI) advancements have enabled the development of Large Language Models (LLMs) that can perform a variety of tasks with remarkable semantic understanding and accuracy. ChatGPT is one such LLM that has gained significant attention due to its impressive capabilities for assisting in various knowledge-intensive tasks. Due to the knowledge-intensive nature of engineering secure software, ChatGPT's assistance is expected to be explored for security-related tasks during the development/evolution of software. To gain an understanding of the potential of ChatGPT as an emerging technology for supporting software security, we adopted a two-fold approach. Initially, we performed an empirical study to analyse the perceptions of those who had explored the use of ChatGPT for security tasks and shared their views on Twitter. It was determined that security practitioners view ChatGPT as beneficial for various software security tasks, including vulnerability detection, information retrieval, and penetration testing. Secondly, we designed an experiment aimed at investigating the practicality of this technology when deployed as an oracle in real-world settings. In particular, we focused on vulnerability detection and qualitatively examined ChatGPT outputs for given prompts within this prominent software security task. Based on our analysis, responses from ChatGPT in this task are largely filled with generic security information and may not be appropriate for industry use. To prevent data leakage, we performed this analysis on a vulnerability dataset compiled after the OpenAI data cut-off date from real-world projects covering 40 distinct vulnerability types and 12 programming languages. We assert that the findings from this study would contribute to future research aimed at developing and evaluating LLMs dedicated to software security.

8/2/2024

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

Jianian Gong, Nachuan Duan, Ziheng Tao, Zhaohui Gong, Yuan Yuan, Minlie Huang

The rapid advancement of large language models (LLMs) such as GPT-4 has revolutionized the landscape of software engineering, positioning these models at the core of modern development practices. As we anticipate these models to evolve into the primary and trustworthy tools used in software development, ensuring the security of the code they produce becomes paramount. How well can LLMs serve as end-to-end secure code producers? This paper presents a systematic investigation into LLMs' inherent potential to generate code with fewer vulnerabilities. Specifically, We studied GPT-3.5 and GPT-4's capability to identify and repair vulnerabilities in the code generated by four popular LLMs including themselves (GPT-3.5, GPT-4, Code Llama, and CodeGeeX2). By manually or automatically reviewing 4,900 pieces of code, our study reveals that: (1) large language models lack awareness of scenario-relevant security risks, which leads to the generation of over 75% vulnerable code on the SecurityEval benchmark; (2) LLMs such as GPT-3.5 and GPT-4 are unable to precisely identify vulnerabilities in the code they generated; (3) GPT-3.5 and GPT-4 can achieve 33.2%~59.6% success rates in repairing the insecure code produced by the 4 LLMs, but they both perform poorly when repairing self-produced code, indicating self-repair blind spots. To address the limitation of a single round of repair, we developed a lightweight tool that prompts LLMs to construct safer source code through an iterative repair procedure based on the insights gained from our study. Experiments show that assisted by semantic analysis engines, our tool significantly improves the success rates of repair to 65.9%~85.5%.

8/21/2024