Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

2406.03847

Published 6/10/2024 by Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, Kai Chen

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

Abstract

Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at https://github.com/InternLM/InternLM-Math and our data at https://huggingface.co/datasets/InternLM/Lean-Workbook.

Create account to get full access

Overview

This paper presents the "Lean Workbook", a large-scale collection of math problems formalized from natural language using the Lean theorem prover.
The Lean Workbook aims to serve as a benchmark for evaluating the mathematical reasoning capabilities of large language models.
The paper describes the process of formalizing the problems, the structure and contents of the Lean Workbook, and preliminary experiments using the workbook.

Plain English Explanation

The "Lean Workbook" is a set of math problems that have been translated into a formal, computer-readable format using the Lean theorem prover. Lean is a software tool that allows mathematicians and computer scientists to precisely define and reason about mathematical concepts and proofs.

The Lean Workbook was created by taking a large number of math word problems, typically the kind you might find in a textbook or on a standardized test, and formalizing them using Lean's language and logic. This means the problems were translated from natural language (like English) into a formal mathematical notation that a computer can understand and work with.

The goal of the Lean Workbook is to provide a benchmark or test set for evaluating the mathematical reasoning capabilities of large language models, which are AI systems trained on vast amounts of text data. Researchers can use the Lean Workbook to see how well these language models can understand, solve, and reason about mathematical problems, which is an important capability for many real-world applications.

By formalizing the problems in Lean, the researchers have created a precise, machine-readable version of the problems that can be used to test and measure the performance of AI systems in a rigorous way. This is an important step in advancing the state of the art in AI-powered mathematical reasoning.

Technical Explanation

The Lean Workbook is a large collection of math problems that have been formalized using the Lean theorem prover. The Lean theorem prover is a software tool that allows users to precisely define mathematical concepts and prove theorems in a formal, computer-readable way.

To create the Lean Workbook, the researchers took a set of natural language math problems, typically the kind found in textbooks or on standardized tests, and translated them into Lean's formal language. This process involved breaking down the problems, identifying the relevant mathematical concepts and logical structure, and encoding them using Lean's syntax and type system.

The resulting Lean Workbook contains over 13,000 individual problem statements, covering a wide range of mathematical topics such as algebra, geometry, calculus, and more. Each problem is accompanied by a formal Lean statement that precisely defines the problem, as well as supporting lemmas and theorems needed to solve it.

The researchers conducted preliminary experiments using the Lean Workbook to evaluate the mathematical reasoning capabilities of large language models such as GPT-3. They found that while these models showed some ability to understand and reason about the math problems, their performance was still limited compared to human experts.

Critical Analysis

The Lean Workbook represents an important step forward in the evaluation of AI systems for mathematical reasoning. By formalizing a large set of math problems in a precise, machine-readable format, the researchers have created a valuable benchmark that can be used to rigorously test and measure the capabilities of language models and other AI systems.

However, the paper also acknowledges several limitations and challenges. First, the process of formalizing the natural language problems into Lean is complex and time-consuming, requiring significant mathematical and technical expertise. This could make it difficult to scale the Lean Workbook to an even larger set of problems.

Additionally, the researchers note that the current Lean Workbook is still relatively narrow in scope, focusing primarily on problems that can be solved using basic mathematical concepts and logic. More advanced problems involving complex reasoning, creativity, or domain-specific knowledge may require further extensions to the workbook.

There is also the question of how well the Lean Workbook's formalization of the problems captures the full semantics and nuance of the original natural language statements. Some of the richness and ambiguity of human language may be lost in the translation to formal logic.

Overall, the Lean Workbook represents an important contribution to the field of AI-powered mathematical reasoning, but further research and development will be needed to fully realize its potential as a benchmark for evaluating the mathematical capabilities of large language models and other AI systems.

Conclusion

The "Lean Workbook" presented in this paper is a significant step forward in the effort to create rigorous, machine-readable benchmarks for evaluating the mathematical reasoning capabilities of large language models and other AI systems. By formalizing a large set of math problems in the Lean theorem prover, the researchers have provided a valuable tool for testing and measuring the performance of these systems in a precise and systematic way.

While the current Lean Workbook has some limitations, such as the complexity of the formalization process and the relatively narrow scope of the problems, it represents an important foundation for future research and development in this area. As AI systems continue to advance in their mathematical reasoning abilities, tools like the Lean Workbook will be crucial for driving progress and ensuring these systems can be applied effectively to real-world problems that require strong mathematical skills.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📊

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang

Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

5/24/2024

cs.AI

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

Avinash Anand, Mohit Gupta, Kritarth Prasad, Navya Singla, Sanjana Sanjeev, Jatin Kumar, Adarsh Raj Shivam, Rajiv Ratn Shah

The rapid progress in the field of natural language processing (NLP) systems and the expansion of large language models (LLMs) have opened up numerous opportunities in the field of education and instructional methods. These advancements offer the potential for tailored learning experiences and immediate feedback, all delivered through accessible and cost-effective services. One notable application area for this technological advancement is in the realm of solving mathematical problems. Mathematical problem-solving not only requires the ability to decipher complex problem statements but also the skill to perform precise arithmetic calculations at each step of the problem-solving process. However, the evaluation of the arithmetic capabilities of large language models remains an area that has received relatively little attention. In response, we introduce an extensive mathematics dataset called MathQuest sourced from the 11th and 12th standard Mathematics NCERT textbooks. This dataset encompasses mathematical challenges of varying complexity and covers a wide range of mathematical concepts. Utilizing this dataset, we conduct fine-tuning experiments with three prominent LLMs: LLaMA-2, WizardMath, and MAmmoTH. These fine-tuned models serve as benchmarks for evaluating their performance on our dataset. Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems. Consequently, MAmmoTH-13B establishes itself as a robust and dependable benchmark for addressing NCERT mathematics problems.

4/23/2024

cs.CL cs.AI

New!TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, Tong Zhang

Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes **TheoremLlama**, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The **TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.

7/4/2024

cs.FL cs.AI

Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data

Haolong Li, Yu Ma, Yinqi Zhang, Chen Ye, Jie Chen

Large Language Models (LLMs) have shown excellent performance in language understanding, text generation, code synthesis, and many other tasks, while they still struggle in complex multi-step reasoning problems, such as mathematical reasoning. In this paper, through a newly proposed arithmetical puzzle problem, we show that the model can perform well on multi-step reasoning tasks via fine-tuning on high-quality synthetic data. Experimental results with the open-llama-3B model on three different test datasets show that not only the model can reach a zero-shot pass@1 at 0.44 on the in-domain dataset, it also demonstrates certain generalization capabilities on the out-of-domain datasets. Specifically, this paper has designed two out-of-domain datasets in the form of extending the numerical range and the composing components of the arithmetical puzzle problem separately. The fine-tuned models have shown encouraging performance on these two far more difficult tasks with the zero-shot pass@1 at 0.33 and 0.35, respectively.

6/5/2024

cs.CL