Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models

Read original: arXiv:2408.00197 - Published 8/2/2024 by Elijah Pelofske, Vincent Urias, Lorie M. Liebrock

🌀

Overview

This paper presents a novel approach to automated software vulnerability static code analysis using Generative Pre-trained Transformer (GPT) models.
The researchers develop a system that can detect and classify software vulnerabilities in source code by fine-tuning large language models on vulnerability datasets.
The proposed method shows promising results in accurately identifying and categorizing various types of software vulnerabilities.

Plain English Explanation

The paper describes a new way to automatically analyze source code for security vulnerabilities using advanced artificial intelligence (AI) models. The researchers trained large language models, similar to GPT-3, on datasets of known software vulnerabilities. This allows the models to learn the patterns and characteristics of different types of vulnerabilities, such as buffer overflows, SQL injections, and cross-site scripting attacks.

Once the models are trained, they can then be applied to new source code to detect and classify any potential vulnerabilities. This is a powerful approach because it can catch security flaws that might be missed by traditional static code analysis tools, which rely on predefined rules and heuristics. The AI models can identify more complex and subtle vulnerability patterns that are difficult for humans to spot.

By automating the vulnerability detection process, the researchers aim to help software developers and security teams find and fix security issues more efficiently, ultimately improving the overall security of software systems. This is especially important as software becomes increasingly complex and the threat landscape continues to evolve.

Technical Explanation

The paper proposes a novel approach for automated software vulnerability static code analysis using Generative Pre-trained Transformer (GPT) models. The researchers fine-tune large language models, such as GPT-2 and GPT-3, on datasets of known software vulnerabilities to enable the models to learn the characteristics and patterns of different vulnerability types.

The researchers design a two-stage vulnerability detection system. First, the fine-tuned GPT model is used to classify the type of vulnerability present in a given code snippet. This vulnerability type prediction is then used to guide a second stage of the system, where another fine-tuned model is used to generate a natural language description of the vulnerability and its potential impact.

The authors evaluate their approach on several publicly available vulnerability datasets, including the SARD and Juliet test suites. The results demonstrate that the proposed method can accurately identify and categorize various types of vulnerabilities, outperforming traditional static code analysis tools in many cases.

Critical Analysis

The paper presents a promising approach to automated software vulnerability detection using state-of-the-art language models. However, the authors acknowledge several limitations and areas for future research:

Dataset Bias: The performance of the models is heavily dependent on the quality and coverage of the training datasets. The authors note that the available vulnerability datasets may not represent the full spectrum of real-world software vulnerabilities, potentially leading to biases in the model's predictions.
Interpretability: While the natural language descriptions generated by the second-stage model can provide valuable insights, the overall system's inner workings and decision-making process remain somewhat opaque. Improving the interpretability of the models could help developers better understand and trust the vulnerability detection process.
Real-world Deployment: The paper focuses on evaluating the approach on benchmark datasets, but further research is needed to assess its performance and practicality in real-world software development and security testing scenarios.
Vulnerability Remediation: The current system is limited to vulnerability detection and classification. Expanding the capabilities to also provide automated recommendations or guidance for remediating the identified vulnerabilities could further enhance the system's usefulness.

Despite these limitations, the paper's findings demonstrate the potential of advanced language models in automating and enhancing software vulnerability analysis, an important step towards improving the overall security of software systems.

Conclusion

This paper presents a novel approach to automated software vulnerability static code analysis using Generative Pre-trained Transformer (GPT) models. By fine-tuning large language models on vulnerability datasets, the researchers have developed a system that can accurately identify and classify various types of software vulnerabilities in source code.

The proposed method shows promising results, outperforming traditional static code analysis tools in many cases. While the approach has some limitations, it represents an important step forward in leveraging the power of AI to enhance software security. As the field of AI continues to advance, the authors' work highlights the potential for language models to play a significant role in automated vulnerability detection and, ultimately, in improving the overall security of software systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌀

Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models

Elijah Pelofske, Vincent Urias, Lorie M. Liebrock

Generative Pre-Trained Transformer models have been shown to be surprisingly effective at a variety of natural language processing tasks -- including generating computer code. We evaluate the effectiveness of open source GPT models for the task of automatic identification of the presence of vulnerable code syntax (specifically targeting C and C++ source code). This task is evaluated on a selection of 36 source code examples from the NIST SARD dataset, which are specifically curated to not contain natural English that indicates the presence, or lack thereof, of a particular vulnerability. The NIST SARD source code dataset contains identified vulnerable lines of source code that are examples of one out of the 839 distinct Common Weakness Enumerations (CWE), allowing for exact quantification of the GPT output classification error rate. A total of 5 GPT models are evaluated, using 10 different inference temperatures and 100 repetitions at each setting, resulting in 5,000 GPT queries per vulnerable source code analyzed. Ultimately, we find that the GPT models that we evaluated are not suitable for fully automated vulnerability scanning because the false positive and false negative rates are too high to likely be useful in practice. However, we do find that the GPT models perform surprisingly well at automated vulnerability detection for some of the test cases, in particular surpassing random sampling, and being able to identify the exact lines of code that are vulnerable albeit at a low success rate. The best performing GPT model result found was Llama-2-70b-chat-hf with inference temperature of 0.1 applied to NIST SARD test case 149165 (which is an example of a buffer overflow vulnerability), which had a binary classification recall score of 1.0 and a precision of 1.0 for correctly and uniquely identifying the vulnerable line of code and the correct CWE number.

8/2/2024

Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models

Elijah Pelofske, Vincent Urias, Lorie M. Liebrock

Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-h, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over 150,000 function re-write GPT output text blocks, approximately 50,000 of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the algorithm, memory leaks, compiler optimization stability, and character distance to the reference implementation. Remarkably, several generated function variants have a high implementation security risk of being correct for some test vectors, but incorrect for other test vectors. Additionally, many function implementations were not correct to the reference algorithm of SHA-1, but produced hashes that have some of the basic characteristics of hash functions. Many of the function re-writes contained serious flaws such as memory leaks, integer overflows, out of bounds accesses, use of uninitialised values, and compiler optimization instability. Compiler optimization settings and SHA-256 hash checksums of the compiled binaries are used to cluster implementations that are equivalent but may not have identical syntax - using this clustering over 100,000 novel and correct versions of the SHA-1 codebase were generated where each component C function of the reference implementation is different from the original code.

7/11/2024

🗣️

GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, Yang Liu

Smart contracts are prone to various vulnerabilities, leading to substantial financial losses over time. Current analysis tools mainly target vulnerabilities with fixed control or data-flow patterns, such as re-entrancy and integer overflow. However, a recent study on Web3 security bugs revealed that about 80% of these bugs cannot be audited by existing tools due to the lack of domain-specific property description and checking. Given recent advances in Large Language Models (LLMs), it is worth exploring how Generative Pre-training Transformer (GPT) could aid in detecting logicc vulnerabilities. In this paper, we propose GPTScan, the first tool combining GPT with static analysis for smart contract logic vulnerability detection. Instead of relying solely on GPT to identify vulnerabilities, which can lead to high false positives and is limited by GPT's pre-trained knowledge, we utilize GPT as a versatile code understanding tool. By breaking down each logic vulnerability type into scenarios and properties, GPTScan matches candidate vulnerabilities with GPT. To enhance accuracy, GPTScan further instructs GPT to intelligently recognize key variables and statements, which are then validated by static confirmation. Evaluation on diverse datasets with around 400 contract projects and 3K Solidity files shows that GPTScan achieves high precision (over 90%) for token contracts and acceptable precision (57.14%) for large projects like Web3Bugs. It effectively detects ground-truth logic vulnerabilities with a recall of over 70%, including 9 new vulnerabilities missed by human auditors. GPTScan is fast and cost-effective, taking an average of 14.39 seconds and 0.01 USD to scan per thousand lines of Solidity code. Moreover, static confirmation helps GPTScan reduce two-thirds of false positives.

5/7/2024

🤖

Automated Multi-Language to English Machine Translation Using Generative Pre-Trained Transformers

Elijah Pelofske, Vincent Urias, Lorie M. Liebrock

The task of accurate and efficient language translation is an extremely important information processing task. Machine learning enabled and automated translation that is accurate and fast is often a large topic of interest in the machine learning and data science communities. In this study, we examine using local Generative Pretrained Transformer (GPT) models to perform automated zero shot black-box, sentence wise, multi-natural-language translation into English text. We benchmark 16 different open-source GPT models, with no custom fine-tuning, from the Huggingface LLM repository for translating 50 different non-English languages into English using translated TED Talk transcripts as the reference dataset. These GPT model inference calls are performed strictly locally, on single A100 Nvidia GPUs. Benchmark metrics that are reported are language translation accuracy, using BLEU, GLEU, METEOR, and chrF text overlap measures, and wall-clock time for each sentence translation. The best overall performing GPT model for translating into English text for the BLEU metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.152$, for the GLEU metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.256$, for the chrF metric is Llama2-chat-AYT-13B with a mean score across all tested languages of $0.448$, and for the METEOR metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.438$.

4/24/2024