ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation

Read original: arXiv:2402.00093 - Published 7/1/2024 by Bhabesh Mali, Karthik Maddala, Vatsal Gupta, Sweeya Reddy, Chandan Karfa, Ramesh Karri

ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation

Overview

This paper introduces ChIRAAG, a system that uses the large language model ChatGPT to rapidly generate formal assertions for software verification.
The system aims to assist developers in the formal verification process by automatically generating relevant assertions based on the code being verified.
ChIRAAG leverages the natural language understanding and generation capabilities of ChatGPT to translate high-level descriptions of software behavior into formal assertions.

Plain English Explanation

Formal verification is the process of mathematically proving that a piece of software behaves as expected. This is an important step in ensuring the reliability and safety of critical software systems. However, the formal verification process can be time-consuming and challenging, as it requires developers to manually specify the expected behavior of the software using formal logic.

ChIRAAG is a new system that aims to streamline this process by automatically generating formal assertions based on the code being verified. The key idea is to leverage the natural language understanding and generation capabilities of the large language model ChatGPT. The system takes a high-level description of the software's behavior, such as "the program should calculate the average of a list of numbers," and translates this into a formal assertion that can be used in the verification process.

By automating this task, ChIRAAG can help developers save time and effort, allowing them to focus on the more complex aspects of the formal verification process. This could be particularly useful for developers working on large or complex software systems, where manually specifying all the necessary assertions can be a daunting task.

Technical Explanation

The core of ChIRAAG is a fine-tuned version of the ChatGPT language model, which has been trained on a large corpus of software documentation and formal specifications. When provided with a high-level description of the software's behavior, the system uses this fine-tuned model to generate a corresponding formal assertion.

To evaluate the effectiveness of ChIRAAG, the researchers conducted a series of experiments on a benchmark dataset called AssertionBench. This dataset contains a collection of software modules and their corresponding formal assertions, which were used to test the system's ability to generate accurate and relevant assertions.

The results of the experiments showed that ChIRAAG was able to generate formal assertions that closely matched the ground truth, demonstrating the system's potential to assist developers in the formal verification process. The researchers also explored ways to further improve the system's performance, such as by incorporating techniques like StackRAG and SATyrn to enhance the model's understanding of software concepts and the development and evaluation of retrieval-augmented generation tools to further improve the quality of the generated assertions.

Critical Analysis

One potential limitation of ChIRAAG is that the system's performance may be dependent on the quality and coverage of the training data used to fine-tune the ChatGPT model. If the training data does not adequately represent the range of software domains and programming styles, the system may struggle to generate accurate assertions for certain types of software. Additionally, the researchers note that the system's performance may be affected by the complexity and ambiguity of the high-level descriptions provided as input.

Another concern is the potential for the system to generate assertions that are technically correct but do not align with the developer's intended behavior. This could lead to false positives in the formal verification process, where the system reports that the software is behaving correctly when it is not. To mitigate this risk, the researchers suggest incorporating additional techniques, such as RAG-enabled conversations about household electricity monitoring, to better understand the developer's intent and generate more targeted assertions.

Overall, the ChIRAAG system represents an interesting and potentially valuable approach to streamlining the formal verification process. However, further research and development may be necessary to address the system's limitations and ensure its robustness and reliability in real-world software development scenarios.

Conclusion

The ChIRAAG system introduced in this paper demonstrates the potential for large language models like ChatGPT to assist in the formal verification of software. By automatically generating relevant formal assertions based on high-level descriptions of software behavior, ChIRAAG can help developers save time and effort in the verification process.

While the experimental results are promising, the researchers identify several areas for further improvement, such as enhancing the system's understanding of software concepts and ensuring the generated assertions accurately reflect the developer's intent. Continued advancements in this area could lead to more efficient and reliable formal verification tools, ultimately contributing to the development of higher-quality and more dependable software systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation

Bhabesh Mali, Karthik Maddala, Vatsal Gupta, Sweeya Reddy, Chandan Karfa, Ramesh Karri

System Verilog Assertion (SVA) formulation -- a critical yet complex task is a prerequisite in the Assertion Based Verification (ABV) process. Traditionally, SVA formulation involves expert-driven interpretation of specifications, which is time-consuming and prone to human error. Recently, LLM-informed automatic assertion generation is gaining interest. We designed a novel framework called ChIRAAG, based on OpenAI GPT4, to generate SVA from natural language specifications of a design. ChIRAAG constitutes the systematic breakdown of design specifications into a standardized format, further generating assertions from formatted specifications using LLM. Furthermore, we used few test cases to validate the LLM-generated assertions. Automatic feedback of log messages from the simulation tool to the LLM ensures that the framework can generate correct SVAs. In our experiments, only 27% of LLM-generated raw assertions had errors, which was rectified in few iterations based on the simulation log. Our results on OpenTitan designs show that LLMs can streamline and assist engineers in the assertion generation process, reshaping verification workflows.

7/1/2024

(Security) Assertions by Large Language Models

Rahul Kande (Texas A&M University), Hammond Pearce (University of New South Wales), Benjamin Tan (University of Calgary), Brendan Dolan-Gavitt (New York University), Shailja Thakur (New York University), Ramesh Karri (New York University), Jeyavijayan Rajendran (Texas A&M University)

The security of computer systems typically relies on a hardware root of trust. As vulnerabilities in hardware can have severe implications on a system, there is a need for techniques to support security verification activities. Assertion-based verification is a popular verification technique that involves capturing design intent in a set of assertions that can be used in formal verification or testing-based checking. However, writing security-centric assertions is a challenging task. In this work, we investigate the use of emerging large language models (LLMs) for code generation in hardware assertion generation for security, where primarily natural language prompts, such as those one would see as code comments in assertion files, are used to produce SystemVerilog assertions. We focus our attention on a popular LLM and characterize its ability to write assertions out of the box, given varying levels of detail in the prompt. We design an evaluation framework that generates a variety of prompts, and we create a benchmark suite comprising real-world hardware designs and corresponding golden reference assertions that we want to generate with the LLM.

7/11/2024

AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation

Vaishnavi Pulavarthi, Deeksha Nandal, Soham Dan, Debjit Pal

Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, ie, detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. There has been a considerable amount of research leveraging a blend of data-driven statistical analysis and static analysis to generate high-quality assertions from hardware design source code and design execution trace data. Despite such concerted effort, all prior research struggles to scale to industrial-scale large designs, generates too many low-quality assertions, often fails to capture subtle and non-trivial design functionality, and does not produce any easy-to-comprehend explanations of the generated assertions to understand assertions' suitability to different downstream validation tasks. Recently, with the advent of Large-Language Models (LLMs), there has been a widespread effort to leverage prompt engineering to generate assertions. However, there is little effort to quantitatively establish the effectiveness and suitability of various LLMs for assertion generation. In this paper, we present AssertionBench, a novel benchmark to evaluate LLMs' effectiveness for assertion generation quantitatively. AssertioBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design generated from GoldMine and HARM. We use AssertionBench to compare state-of-the-art LLMs to assess their effectiveness in inferring functionally correct assertions for hardware designs. Our experiments demonstrate how LLMs perform relative to each other, the benefits of using more in-context exemplars in generating a higher fraction of functionally correct assertions, and the significant room for improvement for LLM-based assertion generators.

6/28/2024

Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework

Kaiyan Chang, Kun Wang, Nan Yang, Ying Wang, Dantong Jin, Wenlong Zhu, Zhirong Chen, Cangyuan Li, Hao Yan, Yunhao Zhou, Zhuoliang Zhao, Yuan Cheng, Yudong Pan, Yiqi Liu, Mengdi Wang, Shengwen Liang, Yinhe Han, Huawei Li, Xiaowei Li

Recent advances in large language models have demonstrated their potential for automated generation of hardware description language (HDL) code from high-level prompts. Researchers have utilized fine-tuning to enhance the ability of these large language models (LLMs) in the field of Chip Design. However, the lack of Verilog data hinders further improvement in the quality of Verilog generation by LLMs. Additionally, the absence of a Verilog and Electronic Design Automation (EDA) script data augmentation framework significantly increases the time required to prepare the training dataset for LLM trainers. This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts. For Verilog generation, it translates Verilog files to an abstract syntax tree and then maps nodes to natural language with a predefined template. For Verilog repair, it uses predefined rules to generate the wrong verilog file and then pairs EDA Tool feedback with the right and wrong verilog file. For EDA Script generation, it uses existing LLM(GPT-3.5) to obtain the description of the Script. To evaluate the effectiveness of our data augmentation method, we finetune Llama2-13B and Llama2-7B models using the dataset generated by our augmentation framework. The results demonstrate a significant improvement in the Verilog generation tasks with LLMs. Moreover, the accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark. Our 13B model (ChipGPT-FT) has a pass rate improvement compared with GPT-3.5 in Verilog generation and outperforms in EDA script (i.e., SiliconCompiler) generation with only 200 EDA script data.

7/11/2024