FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

Read original: arXiv:2406.14408 - Published 6/24/2024 by Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, Xiaodan Liang

FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

Overview

This paper introduces FVEL, an interactive formal verification environment that integrates large language models (LLMs) with theorem proving to assist users in verifying software properties.
The system allows users to interact with the LLM to describe and refine their verification goals, then uses theorem proving techniques to attempt to automatically verify the desired properties.
The paper demonstrates how FVEL can be used to efficiently verify properties of programs written in a variety of languages, including LEMUR, VHDL, and Coq.

Plain English Explanation

FVEL is a new tool that helps people verify that their software programs work correctly. It does this by combining two powerful technologies: large language models (LLMs) and theorem proving.

LLMs are very advanced AI systems that can understand and generate human language. In FVEL, the user can interact with the LLM to describe the properties they want their program to have, such as "the program should never divide by zero" or "the program should always return a positive number."

The system then uses theorem proving techniques to try to mathematically verify that the program actually has those properties. Theorem proving is a way of rigorously proving that something is true, like a mathematician proving a theorem.

By combining the language understanding of LLMs with the formal verification capabilities of theorem proving, FVEL can help users efficiently check that their programs behave as expected, even for complex software written in different programming languages like LEMUR, VHDL, and Coq. This can save time and catch bugs early in the development process.

Technical Explanation

The core of FVEL is an interactive system that allows users to communicate with a large language model (LLM) to describe the desired properties of their software programs. The user can iteratively refine their goals and specifications using natural language, and FVEL then attempts to automatically verify those properties using theorem proving techniques.

FVEL's architecture incorporates several key components:

LLM Interface: This module handles the user's natural language interactions, allowing them to express their verification goals and receive feedback from the system.
Theorem Proving Engine: This component takes the user's specifications and uses advanced reasoning algorithms, like those employed in tools like CLOVER and DeepSeek, to attempt to formally verify the desired program properties.
Knowledge Base: FVEL maintains a knowledge base of formal definitions, lemmas, and other information to assist the theorem proving process.

The key innovation of FVEL is its ability to leverage the natural language understanding capabilities of LLMs to facilitate a more intuitive and interactive formal verification workflow. By allowing users to describe their goals in plain language, FVEL lowers the barrier to entry for formal verification, making it more accessible to a wider range of developers and engineers.

Critical Analysis

The authors of the paper have demonstrated the potential of FVEL to streamline the formal verification process, but there are a few important considerations to keep in mind:

Limitations of Theorem Proving: While FVEL's theorem proving engine is powerful, there are inherent limitations to what can be automatically verified, especially for complex software systems. The paper acknowledges that human guidance and intervention may still be required in some cases.
Reliance on LLM Accuracy: FVEL's effectiveness relies heavily on the accuracy and reliability of the underlying large language model. If the LLM makes mistakes in understanding the user's intent or generating formal specifications, it could lead to incorrect verification results.
Scalability and Performance: The authors should investigate how FVEL's performance scales as the size and complexity of the verified programs increase. Efficient handling of large codebases and complex verification tasks will be crucial for practical deployment.
Formal Verification Expertise: While FVEL aims to make formal verification more accessible, users may still require some background knowledge in formal methods and theorem proving to effectively use the system. Providing sufficient guidance and educational resources will be important.

Overall, FVEL represents a promising approach to integrating large language models with theorem proving for interactive formal verification. As the authors continue to refine and expand the system, addressing these critical areas will be key to ensuring its widespread adoption and impact.

Conclusion

The FVEL system described in this paper demonstrates a novel approach to formal software verification by combining the natural language understanding of large language models with the rigorous reasoning capabilities of theorem proving. By allowing users to interactively describe their verification goals and refine them through natural language interactions, FVEL lowers the barrier to entry for formal verification and can be applied to a wide range of programming languages and domains.

While the system has some limitations and areas for further improvement, the authors have shown the potential of this approach to streamline the verification process and help developers ensure the correctness of their software more efficiently. As the field of large language models and their integration with formal methods continues to evolve, tools like FVEL may play an increasingly important role in enhancing the reliability and trustworthiness of complex software systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, Xiaodan Liang

Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.

6/24/2024

Lemur: Integrating Large Language Models in Automated Program Verification

Haoze Wu, Clark Barrett, Nina Narodytska

The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.

4/26/2024

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, Tong Zhang

Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes **TheoremLlama**, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The **TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.

7/4/2024

Towards Large Language Model Aided Program Refinement

Yufan Cai, Zhe Hou, Xiaokun Luan, David Miguel Sanan Baena, Yun Lin, Jun Sun, Jin Song Dong

Program refinement involves correctness-preserving transformations from formal high-level specification statements into executable programs. Traditional verification tool support for program refinement is highly interactive and lacks automation. On the other hand, the emergence of large language models (LLMs) enables automatic code generations from informal natural language specifications. However, code generated by LLMs is often unreliable. Moreover, the opaque procedure from specification to code provided by LLM is an uncontrolled black box. We propose LLM4PR, a tool that combines formal program refinement techniques with informal LLM-based methods to (1) transform the specification to preconditions and postconditions, (2) automatically build prompts based on refinement calculus, (3) interact with LLM to generate code, and finally, (4) verify that the generated code satisfies the conditions of refinement calculus, thus guaranteeing the correctness of the code. We have implemented our tool using GPT4, Coq, and Coqhammer, and evaluated it on the HumanEval and EvalPlus datasets.

6/28/2024