LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing

2406.07714

Published 6/17/2024 by Hongxiang Zhang, Yuyang Rong, Yifeng He, Hao Chen

LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing

Abstract

Greybox fuzzing has achieved success in revealing bugs and vulnerabilities in programs. However, randomized mutation strategies have limited the fuzzer's performance on structured data. Specialized fuzzers can handle complex structured data, but require additional efforts in grammar and suffer from low throughput. In this paper, we explore the potential of utilizing the Large Language Model to enhance greybox fuzzing for structured data. We utilize the pre-trained knowledge of LLM about data conversion and format to generate new valid inputs. We further fine-tuned it with paired mutation seeds to learn structured format and mutation strategies effectively. Our LLM-based fuzzer, LLAMAFUZZ, integrates the power of LLM to understand and mutate structured data to fuzzing. We conduct experiments on the standard bug-based benchmark Magma and a wide variety of real-world programs. LLAMAFUZZ outperforms our top competitor by 41 bugs on average. We also identified 47 unique bugs across all trials. Moreover, LLAMAFUZZ demonstrated consistent performance on both bug trigger and bug reached. Compared to AFL++, LLAMAFUZZ achieved 27.19% more branches in real-world program sets on average. We also demonstrate a case study to explain how LLMs enhance the fuzzing process in terms of code coverage.

Create account to get full access

Overview

• LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing explores a new approach to software testing and vulnerability discovery called "greybox fuzzing" that combines traditional fuzzing techniques with large language models.

• The researchers propose LLAMAFUZZ, a system that leverages the capabilities of large language models to generate diverse and effective input data for fuzzing, with the goal of finding more bugs and vulnerabilities in software.

• LLAMAFUZZ addresses some of the key challenges in applying large language models to software vulnerability detection, such as generating inputs that are both semantically valid and capable of triggering edge cases in the software.

Plain English Explanation

Fuzzing is a software testing technique where random or semi-random inputs are fed into a program to find bugs or vulnerabilities. LLAMAFUZZ builds on this approach by using large language models - powerful AI systems trained on massive amounts of text - to generate the input data more intelligently.

The key idea is that large language models can be used to generate diverse, semantically valid inputs that are more likely to uncover issues in the software than purely random inputs. This is because the language model has learned the structure and patterns of valid input data, and can use this knowledge to generate more targeted and effective test cases.

For example, if the software being tested accepts JSON data as input, a large language model could be used to generate well-formed JSON documents that exercise different parts of the code, rather than just throwing random bytes at the program and hoping for the best.

The researchers show that LLAMAFUZZ is able to find more bugs and vulnerabilities than traditional fuzzing approaches, particularly in software that processes structured data formats. This is an important advancement, as many real-world applications rely on processing complex data formats, and traditional fuzzing can struggle to generate valid inputs for these cases.

Technical Explanation

The core of the LLAMAFUZZ system is a large language model that has been fine-tuned on a corpus of valid input data for the software being tested. This fine-tuned model is then used to generate new input data during the fuzzing process.

The researchers experiment with different approaches for incorporating the language model into the fuzzing loop, such as using the model to generate entire inputs from scratch, or using it to mutate existing inputs in targeted ways. They also explore techniques for ensuring the generated inputs are both semantically valid and capable of triggering edge cases in the software.

Their experiments on a range of benchmark programs show that LLAMAFUZZ is able to find significantly more bugs and vulnerabilities than traditional greybox fuzzing approaches, especially in software that processes structured data formats. The language model-based inputs were not only more effective at finding issues, but also required fewer test cases to do so.

Critical Analysis

The paper presents a compelling approach to enhancing traditional fuzzing techniques with the power of large language models. However, the researchers note that there are still some challenges to overcome, such as:

Ensuring input validity: While the language model helps generate more semantically valid inputs, there may still be edge cases where the generated inputs are not fully compliant with the expected data format. Further work is needed to ensure 100% input validity.
Handling diverse software domains: The experiments in the paper focused on a relatively narrow set of benchmark programs. Applying LLAMAFUZZ to a broader range of software, including highly domain-specific applications, may require additional techniques or fine-tuning of the language model.
Computational cost: Using a large language model for fuzzing may increase the computational resources required compared to traditional approaches. The researchers should explore ways to optimize the system's efficiency.

Overall, the LLAMAFUZZ approach represents an exciting step forward in combining the strengths of large language models and traditional fuzzing techniques. With further refinement and validation on a wider range of software, this technique could become a powerful tool for improving the security and reliability of complex software systems.

Conclusion

The LLAMAFUZZ paper presents a novel approach to software testing and vulnerability discovery that leverages the power of large language models. By using a fine-tuned language model to generate diverse, semantically valid input data, the researchers have shown that LLAMAFUZZ can find significantly more bugs and vulnerabilities than traditional fuzzing techniques, especially in software that processes structured data formats.

While there are still some challenges to overcome, this research represents an important advancement in the field of software security and reliability. By harnessing the capabilities of large language models, LLAMAFUZZ has the potential to play a key role in improving the robustness and safety of a wide range of software applications. As the capabilities of large language models continue to evolve, it will be exciting to see how techniques like LLAMAFUZZ can be further refined and applied to help ensure the security and reliability of the software that powers our increasingly digital world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

When Fuzzing Meets LLMs: Challenges and Opportunities

Yu Jiang, Jie Liang, Fuchen Ma, Yuanliang Chen, Chijin Zhou, Yuheng Shen, Zhiyong Wu, Jingzhou Fu, Mingzhe Wang, ShanShan Li, Quan Zhang

Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a remedy, we propose some actionable recommendations to help improve applying LLM in Fuzzing and conduct preliminary evaluations on DBMS fuzzing. The results demonstrate that our recommendations effectively address the identified challenges.

4/26/2024

cs.SE cs.AI

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

Robert Osazuwa Ness, Katie Matton, Hayden Helm, Sheng Zhang, Junaid Bajwa, Carey E. Priebe, Eric Horvitz

Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful attacks modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless trick the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a MedFuzzed benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.

6/12/2024

cs.CL cs.LG

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

6/28/2024

cs.AI

💬

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

Karl Tamberg, Hayretdin Bahsi

Despite various approaches being employed to detect vulnerabilities, the number of reported vulnerabilities shows an upward trend over the years. This suggests the problems are not caught before the code is released, which could be caused by many factors, like lack of awareness, limited efficacy of the existing vulnerability detection tools or the tools not being user-friendly. To help combat some issues with traditional vulnerability detection tools, we propose using large language models (LLMs) to assist in finding vulnerabilities in source code. LLMs have shown a remarkable ability to understand and generate code, underlining their potential in code-related tasks. The aim is to test multiple state-of-the-art LLMs and identify the best prompting strategies, allowing extraction of the best value from the LLMs. We provide an overview of the strengths and weaknesses of the LLM-based approach and compare the results to those of traditional static analysis tools. We find that LLMs can pinpoint many more issues than traditional static analysis tools, outperforming traditional tools in terms of recall and F1 scores. The results should benefit software developers and security analysts responsible for ensuring that the code is free of vulnerabilities.

5/27/2024

cs.CR cs.AI cs.SE