Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages

Read original: arXiv:2406.09757 - Published 6/17/2024 by Shuvendu K. Lahiri

💬

Overview

This paper explores using large language models (LLMs) to help users formalize their intent when working with verification-aware programming languages like Dafny.
The goal is to make it easier for users to write formal specifications and proofs by automating parts of the process.
The authors evaluate different LLM-based approaches on a benchmark dataset and find that their method, called LEMUR, outperforms other techniques.

Plain English Explanation

Writing code that can be automatically verified for correctness is an important but challenging task. Verification-aware languages like Dafny require users to specify formal properties about their program's behavior, which can be tedious and error-prone.

This paper explores using large language models (LLMs) - AI systems trained on vast amounts of text data - to help users with this process. The idea is that an LLM could take a user's natural language description of their intent and automatically generate the corresponding formal specifications.

The researchers developed a system called LEMUR that integrates LLMs into the verification workflow. They tested it on a benchmark dataset and found that it outperformed other approaches, making it easier for users to write correct programs in verification-aware languages.

By reducing the burden of writing formal specifications, this work has the potential to make formal verification more accessible and widely adopted, improving the quality and reliability of software systems.

Technical Explanation

The paper introduces a new system called LEMUR that leverages large language models (LLMs) to assist users in formalizing their intent when working with verification-aware programming languages like Dafny.

LEMUR works by first having the user provide a natural language description of their program's intended behavior. The system then uses an LLM to generate corresponding formal specifications, including pre- and post-conditions, loop invariants, and other annotations required by the verification system.

To evaluate LEMUR, the authors created a new benchmark dataset called DafnyBench with a variety of verification tasks. They compared LEMUR's performance to other LLM-based and rule-based approaches, and found that LEMUR outperformed the baselines, generating more accurate formal specifications.

The paper also includes an analysis of LEMUR's strengths and limitations. For example, the authors found that LEMUR struggled with tasks that required more complex reasoning or domain-specific knowledge. They suggest that future work could explore ways to better integrate LLMs with the verification process, or to leverage additional input modalities beyond just natural language.

Critical Analysis

The paper presents a promising approach for using large language models to assist with formal verification, but there are a few important caveats to consider:

First, the authors acknowledge that LEMUR's performance degrades on more complex verification tasks that require deeper reasoning or domain-specific knowledge. This suggests that while LLMs can be useful for automating certain aspects of formalization, there may be limits to their capabilities when it comes to more sophisticated verification problems.

Additionally, the benchmark dataset used to evaluate LEMUR, DafnyBench, is relatively small and may not capture the full diversity of real-world verification challenges. Further testing on larger and more diverse datasets would be helpful to better understand LEMUR's strengths and weaknesses.

It's also worth noting that the paper does not address potential issues around the reliability and trustworthiness of LLM-generated formal specifications. If these specifications are used as the basis for formal verification, any errors or biases in the LLM's outputs could lead to incorrect proofs and unsound software. Addressing these safety and reliability concerns will be an important area for future research.

Overall, the paper makes a valuable contribution by demonstrating the potential of LLMs to streamline the formalization process. However, more work is needed to fully realize the benefits of this approach while mitigating its limitations and risks.

Conclusion

This paper explores the use of large language models (LLMs) to assist users in formalizing their intent when working with verification-aware programming languages like Dafny. The researchers developed a system called LEMUR that integrates LLMs into the verification workflow, allowing users to provide natural language descriptions of their program's behavior and automatically generate corresponding formal specifications.

Evaluating LEMUR on a new benchmark dataset called DafnyBench, the authors found that their approach outperformed other LLM-based and rule-based techniques, suggesting that LLMs can be a useful tool for automating parts of the formalization process.

While this work represents an important step forward, the paper also highlights some of the limitations of LLM-driven formalization, such as struggles with complex reasoning and domain-specific knowledge. Addressing these challenges and ensuring the reliability of LLM-generated formal specifications will be important areas for future research.

Overall, this paper demonstrates the potential of leveraging large language models to make formal verification more accessible and user-friendly, which could have significant implications for improving the quality and reliability of software systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages

Shuvendu K. Lahiri

Verification-aware programming languages such as Dafny and F* provide means to formally specify and prove properties of programs. Although the problem of checking an implementation against a specification can be defined mechanically, there is no algorithmic way of ensuring the correctness of the user-intent formalization for programs -- that a specification adheres to the user's intent behind the program. The intent or requirement is expressed informally in natural language and the specification is a formal artefact. The advent of large language models (LLMs) has made strides bridging the gap between informal intent and formal program implementations recently, driven in large parts due to benchmarks and automated metrics for evaluation. Recent work has proposed evaluating {it user-intent formalization} problem for mainstream programming languages~cite{endres-fse24}. However, such an approach does not readily extend to verification-aware languages that support rich specifications (containing quantifiers and ghost variables) that cannot be evaluated through dynamic execution. Previous work also required generating program mutants using LLMs to create the benchmark. We advocate an alternate approach of {it symbolically testing specifications} to provide an intuitive metric for evaluating the quality of specifications for verification-aware languages. We demonstrate that our automated metric agrees closely with mostly GPT-4 generated and human-labeled dataset of roughly 150 Dafny specifications for the popular MBPP code-generation benchmark, yet demonstrates cases where the human labeling is not perfect. We believe our work provides a stepping stone to enable the establishment of a benchmark and research agenda for the problem of user-intent formalization for programs.

6/17/2024

Towards Large Language Model Aided Program Refinement

Yufan Cai, Zhe Hou, Xiaokun Luan, David Miguel Sanan Baena, Yun Lin, Jun Sun, Jin Song Dong

Program refinement involves correctness-preserving transformations from formal high-level specification statements into executable programs. Traditional verification tool support for program refinement is highly interactive and lacks automation. On the other hand, the emergence of large language models (LLMs) enables automatic code generations from informal natural language specifications. However, code generated by LLMs is often unreliable. Moreover, the opaque procedure from specification to code provided by LLM is an uncontrolled black box. We propose LLM4PR, a tool that combines formal program refinement techniques with informal LLM-based methods to (1) transform the specification to preconditions and postconditions, (2) automatically build prompts based on refinement calculus, (3) interact with LLM to generate code, and finally, (4) verify that the generated code satisfies the conditions of refinement calculus, thus guaranteeing the correctness of the code. We have implemented our tool using GPT4, Coq, and Coqhammer, and evaluated it on the HumanEval and EvalPlus datasets.

6/28/2024

Lemur: Integrating Large Language Models in Automated Program Verification

Haoze Wu, Clark Barrett, Nina Narodytska

The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.

4/26/2024

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, Shuvendu K. Lahiri

Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The emergent abilities of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcond, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcond postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that nl2postcond via LLMs has the potential to be helpful in practice; nl2postcond generated postconditions were able to catch 64 real-world historical bugs from Defects4J.

4/17/2024