PropertyGPT: LLM-driven Formal Verification of Smart Contracts through Retrieval-Augmented Property Generation

Read original: arXiv:2405.02580 - Published 5/7/2024 by Ye Liu, Yue Xue, Daoyuan Wu, Yuqiang Sun, Yi Li, Miaolei Shi, Yang Liu

🛸

Overview

This paper explores using large language models (LLMs) like GPT-4 to automatically generate customized properties for unknown code based on existing human-written properties.
The key challenges addressed are ensuring the generated properties are compilable, appropriate, and runtime-verifiable.
The proposed system, called PropertyGPT, successfully generated high-quality properties, detected known vulnerabilities, and uncovered new zero-day vulnerabilities.

Plain English Explanation

Large language models (LLMs) like GPT-4 have become incredibly powerful at understanding and generating human-like text. This paper investigates whether we can leverage these models to help with the process of writing properties for code verification.

Properties are essentially rules or assertions about how a piece of code should behave. When you're working with complex, unfamiliar code, it can be challenging to come up with the right properties to check that the code is working correctly. The researchers wanted to see if an LLM could take examples of existing properties and use that knowledge to automatically generate new, customized properties for other code.

The key challenge is ensuring these automatically generated properties are actually useful - they need to be written in a way that the computer can understand and verify, and they need to accurately describe the intended behavior of the code. The researchers developed strategies to address these issues, such as using compiler feedback to refine the properties and designing a specialized prover to formally verify their correctness.

Overall, the researchers' system called PropertyGPT was able to generate high-quality properties that were effective at detecting known vulnerabilities and even uncovering new ones that had not been discovered before. This shows the potential for using powerful language models to assist with the critical task of verifying the security and correctness of complex software.

Technical Explanation

The core idea of this paper is to leverage the language understanding capabilities of large language models (LLMs) like GPT-4 to automatically generate customized properties for unknown code. The researchers start by embedding existing human-written properties (e.g., from Certora auditing reports) into a vector database.

When given a new piece of code, the system retrieves a reference property from the database and uses it as a starting point for LLM-based in-context learning to generate a new, tailored property. However, ensuring the generated properties are (i) compilable, (ii) appropriate, and (iii) runtime-verifiable presents several challenges.

To address (i), the researchers use the compilation and static analysis feedback as an external "oracle" to guide the LLM in iteratively revising the generated properties until they are compilable. For (ii), they consider multiple dimensions of similarity to rank the properties and employ a weighted algorithm to identify the top-K properties as the final result. For (iii), they design a dedicated prover to formally verify the correctness of the generated properties.

The researchers implemented these strategies into a system called PropertyGPT, which they evaluated using 623 human-written properties collected from 23 Certora projects. Their experiments showed that PropertyGPT can generate comprehensive and high-quality properties, achieving an 80% recall compared to the ground truth. It also successfully detected 26 out of 37 known CVEs/attack incidents and uncovered 12 zero-day vulnerabilities, resulting in $8,256 in bug bounty rewards.

Critical Analysis

The researchers have presented a compelling approach to leveraging the power of large language models to assist in the critical task of verifying the security and correctness of complex software. By automating the generation of properties, their system has the potential to significantly streamline the auditing process and uncover vulnerabilities that might otherwise be missed.

That said, the paper does acknowledge some limitations and areas for further research. For example, the researchers note that their approach currently relies on the availability of a substantial corpus of existing human-written properties, which may not always be the case. Additionally, while the prover they designed can formally verify the generated properties, it is unclear how scalable or efficient this process is for real-world, large-scale codebases.

Another potential concern is the potential for bias or inaccuracies in the properties generated by the LLM, even with the iterative refinement process. The researchers do not explore the extent to which the generated properties might diverge from the true intent of the code, and how such discrepancies could impact the effectiveness of the verification process.

Overall, the researchers have presented an intriguing and promising approach, but further work may be needed to address these potential limitations and ensure the reliability and scalability of the PropertyGPT system, especially as it is applied to increasingly complex and mission-critical software systems.

Conclusion

This paper explores the use of large language models (LLMs) like GPT-4 to automatically generate customized properties for unknown code, with the goal of streamlining the process of verifying the security and correctness of complex software. By embedding existing human-written properties into a vector database and using LLM-based in-context learning, the researchers' PropertyGPT system was able to generate high-quality properties that were effective at detecting known vulnerabilities and uncovering new zero-day issues.

While the paper acknowledges some limitations, such as the reliance on a corpus of existing properties and the scalability of the formal verification process, the researchers have demonstrated the significant potential of leveraging powerful language models to assist with the critical task of software auditing. As LLMs continue to advance, integrating them with structured knowledge bases and tailoring them for specific domains could lead to even more powerful and reliable tools for ensuring the safety and security of complex software systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

PropertyGPT: LLM-driven Formal Verification of Smart Contracts through Retrieval-Augmented Property Generation

Ye Liu, Yue Xue, Daoyuan Wu, Yuqiang Sun, Yi Li, Miaolei Shi, Yang Liu

With recent advances in large language models (LLMs), this paper explores the potential of leveraging state-of-the-art LLMs, such as GPT-4, to transfer existing human-written properties (e.g., those from Certora auditing reports) and automatically generate customized properties for unknown code. To this end, we embed existing properties into a vector database and retrieve a reference property for LLM-based in-context learning to generate a new prop- erty for a given code. While this basic process is relatively straight- forward, ensuring that the generated properties are (i) compilable, (ii) appropriate, and (iii) runtime-verifiable presents challenges. To address (i), we use the compilation and static analysis feedback as an external oracle to guide LLMs in iteratively revising the generated properties. For (ii), we consider multiple dimensions of similarity to rank the properties and employ a weighted algorithm to identify the top-K properties as the final result. For (iii), we design a dedicated prover to formally verify the correctness of the generated prop- erties. We have implemented these strategies into a novel system called PropertyGPT, with 623 human-written properties collected from 23 Certora projects. Our experiments show that PropertyGPT can generate comprehensive and high-quality properties, achieving an 80% recall compared to the ground truth. It successfully detected 26 CVEs/attack incidents out of 37 tested and also uncovered 12 zero-day vulnerabilities, resulting in $8,256 bug bounty rewards.

5/7/2024

PropTest: Automatic Property Testing for Improved Visual Programming

Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez

Visual Programming has recently emerged as an alternative to end-to-end black-box visual reasoning models. This type of method leverages Large Language Models (LLMs) to generate the source code for an executable computer program that solves a given problem. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose PropTest, a general strategy that improves visual programming by further using an LLM to generate code that tests for visual properties in an initial round of proposed solutions. Our method generates tests for data-type consistency, output syntax, and semantic properties. PropTest achieves comparable results to state-of-the-art methods while using publicly available LLMs. This is demonstrated across different benchmarks on visual question answering and referring expression comprehension. Particularly, PropTest improves ViperGPT by obtaining 46.1% accuracy (+6.0%) on GQA using Llama3-8B and 59.5% (+8.1%) on RefCOCO+ using CodeLlama-34B.

7/24/2024

🗣️

GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, Yang Liu

Smart contracts are prone to various vulnerabilities, leading to substantial financial losses over time. Current analysis tools mainly target vulnerabilities with fixed control or data-flow patterns, such as re-entrancy and integer overflow. However, a recent study on Web3 security bugs revealed that about 80% of these bugs cannot be audited by existing tools due to the lack of domain-specific property description and checking. Given recent advances in Large Language Models (LLMs), it is worth exploring how Generative Pre-training Transformer (GPT) could aid in detecting logicc vulnerabilities. In this paper, we propose GPTScan, the first tool combining GPT with static analysis for smart contract logic vulnerability detection. Instead of relying solely on GPT to identify vulnerabilities, which can lead to high false positives and is limited by GPT's pre-trained knowledge, we utilize GPT as a versatile code understanding tool. By breaking down each logic vulnerability type into scenarios and properties, GPTScan matches candidate vulnerabilities with GPT. To enhance accuracy, GPTScan further instructs GPT to intelligently recognize key variables and statements, which are then validated by static confirmation. Evaluation on diverse datasets with around 400 contract projects and 3K Solidity files shows that GPTScan achieves high precision (over 90%) for token contracts and acceptable precision (57.14%) for large projects like Web3Bugs. It effectively detects ground-truth logic vulnerabilities with a recall of over 70%, including 9 new vulnerabilities missed by human auditors. GPTScan is fast and cost-effective, taking an average of 14.39 seconds and 0.01 USD to scan per thousand lines of Solidity code. Moreover, static confirmation helps GPTScan reduce two-thirds of false positives.

5/7/2024

AuditGPT: Auditing Smart Contracts with ChatGPT

Shihao Xia, Shuai Shao, Mengting He, Tingting Yu, Linhai Song, Yiying Zhang

To govern smart contracts running on Ethereum, multiple Ethereum Request for Comment (ERC) standards have been developed, each containing a set of rules to guide the behaviors of smart contracts. Violating the ERC rules could cause serious security issues and financial loss, signifying the importance of verifying smart contracts follow ERCs. Today's practices of such verification are to either manually audit each single contract or use expert-developed, limited-scope program-analysis tools, both of which are far from being effective in identifying ERC rule violations. This paper presents a tool named AuditGPT that leverages large language models (LLMs) to automatically and comprehensively verify ERC rules against smart contracts. To build AuditGPT, we first conduct an empirical study on 222 ERC rules specified in four popular ERCs to understand their content, their security impacts, their specification in natural language, and their implementation in Solidity. Guided by the study, we construct AuditGPT by separating the large, complex auditing process into small, manageable tasks and design prompts specialized for each ERC rule type to enhance LLMs' auditing performance. In the evaluation, AuditGPT successfully pinpoints 418 ERC rule violations and only reports 18 false positives, showcasing its effectiveness and accuracy. Moreover, AuditGPT beats an auditing service provided by security experts in effectiveness, accuracy, and cost, demonstrating its advancement over state-of-the-art smart-contract auditing practices.

4/9/2024