Laurel: Generating Dafny Assertions Using Large Language Models

Read original: arXiv:2405.16792 - Published 5/28/2024 by Eric Mugnier, Emmanuel Anaya Gonzalez, Ranjit Jhala, Nadia Polikarpova, Yuanyuan Zhou

Laurel: Generating Dafny Assertions Using Large Language Models

Overview

The paper introduces Laurel, a system that generates Dafny assertions using large language models (LLMs).
Dafny is a programming language and verification tool that allows developers to specify and verify program properties.
Laurel aims to assist developers by automatically generating Dafny assertions from natural language descriptions of program behavior.
This can help bridge the gap between high-level program requirements and the formal specifications needed for verification.

Plain English Explanation

Laurel is a tool that helps programmers verify the correctness of their code by automatically generating formal assertions from natural language descriptions. Programmers often struggle to translate high-level program requirements into the detailed, mathematically-precise specifications needed for formal verification. Laurel uses powerful AI language models to bridge this gap, generating Dafny assertions directly from English descriptions of desired program behavior. This can save developers time and effort, and help ensure their code works as intended. The Lemur and Using Large Language Models for De-formalization of Natural systems also explore ways to leverage large language models for program verification and specification, while the Harnessing Large Language Models for Software Vulnerability Detection and Towards Large Language Models as Copilots for Theorem Proving papers investigate applying these models to other software engineering tasks.

Technical Explanation

Laurel takes a natural language description of a program's behavior as input and generates corresponding Dafny assertions. The system first encodes the input text using a large language model like GPT-3. It then uses a set of prompts and templates to generate candidate Dafny assertions, which are filtered and ranked to produce the final output. The authors evaluate Laurel on a dataset of Dafny programs and find that it can generate relevant and accurate assertions, outperforming previous approaches that rely on rule-based or template-based generation. This shows the potential of large language models to assist with the tedious and error-prone task of specifying formal program properties. The Towards Logically Consistent Language Models via Probabilistic paper also explores ways to make language models more reliable for tasks like program verification.

Critical Analysis

The authors acknowledge several limitations of Laurel. First, the system relies on the quality and coherence of the input natural language descriptions, which may not always be available or well-written. Second, Laurel may struggle to generate complex Dafny assertions that require reasoning about program state or control flow. The authors suggest incorporating additional program analysis and reasoning capabilities to address these issues. Additionally, while Laurel shows promising results, more extensive evaluation on a wider range of programs and verification tasks would help further validate its usefulness in practice.

Conclusion

Laurel demonstrates the potential of large language models to assist with the critical task of program verification. By automatically generating formal Dafny assertions from natural language descriptions, Laurel can help bridge the gap between high-level program requirements and the rigorous specifications needed for verification. This type of AI-assisted program analysis and verification could become an increasingly important tool for software developers, helping to catch bugs and ensure the correctness of complex systems. As language models continue to advance, we can expect to see more sophisticated applications in this domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Laurel: Generating Dafny Assertions Using Large Language Models

Eric Mugnier, Emmanuel Anaya Gonzalez, Ranjit Jhala, Nadia Polikarpova, Yuanyuan Zhou

Dafny is a popular verification language, which automates proofs by outsourcing them to an SMT solver. This automation is not perfect, however, and the solver often requires guidance in the form of helper assertions creating a burden for the proof engineer. In this paper, we propose Laurel, a tool that uses large language models (LLMs) to automatically generate helper assertions for Dafny programs. To improve the success rate of LLMs in this task, we design two domain-specific prompting techniques. First, we help the LLM determine the location of the missing assertion by analyzing the verifier's error message and inserting an assertion placeholder at that location. Second, we provide the LLM with example assertions from the same codebase, which we select based on a new lemma similarity metric. We evaluate our techniques on a dataset of helper assertions we extracted from three real-world Dafny codebases. Our evaluation shows that Laurel is able to generate over 50% of the required helper assertions given only a few attempts, making LLMs a usable and affordable tool to further automate practical program verification.

5/28/2024

(Security) Assertions by Large Language Models

Rahul Kande (Texas A&M University), Hammond Pearce (University of New South Wales), Benjamin Tan (University of Calgary), Brendan Dolan-Gavitt (New York University), Shailja Thakur (New York University), Ramesh Karri (New York University), Jeyavijayan Rajendran (Texas A&M University)

The security of computer systems typically relies on a hardware root of trust. As vulnerabilities in hardware can have severe implications on a system, there is a need for techniques to support security verification activities. Assertion-based verification is a popular verification technique that involves capturing design intent in a set of assertions that can be used in formal verification or testing-based checking. However, writing security-centric assertions is a challenging task. In this work, we investigate the use of emerging large language models (LLMs) for code generation in hardware assertion generation for security, where primarily natural language prompts, such as those one would see as code comments in assertion files, are used to produce SystemVerilog assertions. We focus our attention on a popular LLM and characterize its ability to write assertions out of the box, given varying levels of detail in the prompt. We design an evaluation framework that generates a variety of prompts, and we create a benchmark suite comprising real-world hardware designs and corresponding golden reference assertions that we want to generate with the LLM.

7/11/2024

DafnyBench: A Benchmark for Formal Software Verification

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

6/13/2024

Lemur: Integrating Large Language Models in Automated Program Verification

Haoze Wu, Clark Barrett, Nina Narodytska

The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.

4/26/2024