DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

2405.14333

Published 5/24/2024 by Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang

cs.AI

📊

Abstract

Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

Create account to get full access

Overview

Proof assistants like Lean have greatly improved the accuracy and reliability of mathematical proof verification.
While large language models (LLMs) show promise in mathematical reasoning, they struggle with formal theorem proving due to a lack of training data.
This paper introduces an approach to generate extensive Lean 4 proof data from high-school and undergraduate-level math competition problems.
The approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data.
After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, the model achieved improved whole-proof generation accuracies on the Lean 4 miniF2F and FIMO benchmarks, outperforming other baselines.

Plain English Explanation

Computers have become incredibly adept at verifying the accuracy of mathematical proofs, thanks to specialized "proof assistant" software like Lean. However, large language models (LLMs) - the powerful AI systems that can understand and generate human-like text - have struggled to match this level of formal mathematical reasoning. The key issue is that LLMs lack sufficient training data in the specialized language and techniques of mathematical proofs.

To address this, the researchers developed a clever way to generate a large synthetic dataset of Lean 4 proofs. They started with high-school and college-level math competition problems, translated them into formal mathematical statements, filtered out low-quality ones, and then automatically generated proofs for the remaining statements. This resulted in a dataset of over 8 million formal statements with proofs.

The researchers then fine-tuned their DeepSeekMath 7B language model on this synthetic dataset. The results were impressive - the fine-tuned model outperformed other state-of-the-art approaches on standard benchmarks for generating complete mathematical proofs. It even managed to solve a handful of the challenging Lean 4 Formalized International Mathematical Olympiad (FIMO) problems, which previous models had failed.

This work demonstrates the power of leveraging large-scale synthetic training data to enhance the theorem-proving capabilities of LLMs. By creating a custom dataset tailored to the unique challenges of formal mathematics, the researchers were able to significantly boost the performance of their AI system. Both the dataset and the improved model will be made publicly available, paving the way for further advancements in this exciting field.

Technical Explanation

The paper introduces a novel approach to generate extensive Lean 4 proof data from high-school and undergraduate-level mathematical competition problems. This is motivated by the fact that while large language models (LLMs) have shown promise in mathematical reasoning, their advancement in formal theorem proving has been hindered by a lack of training data.

The key steps of the approach are:

Translating natural language problems: The researchers start by translating natural language math competition problems into formal Lean 4 statements.
Filtering low-quality statements: They then filter out low-quality or ill-formed statements using a combination of manual and automated methods.
Generating proofs: Finally, they generate proofs for the remaining high-quality statements using a combination of automated theorem proving and manual intervention.

This process results in a synthetic dataset of over 8 million formal statements with proofs, which the researchers use to fine-tune their DeepSeekMath 7B model.

When evaluated on the Lean 4 miniF2F and FIMO benchmarks, the fine-tuned DeepSeekMath model achieves impressive results:

On the miniF2F test, it achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively, outperforming the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%.
On the FIMO benchmark, the model successfully proved 5 out of 148 problems, while GPT-4 failed to prove any.

These results demonstrate the potential of leveraging large-scale synthetic data to enhance the theorem-proving capabilities of LLMs. The researchers make both the synthetic dataset and the improved DeepSeekMath model publicly available to facilitate further research in this promising field.

Critical Analysis

The researchers have made a significant contribution by addressing a key challenge in the field of mathematical reasoning with large language models - the lack of high-quality training data for formal proof verification and generation.

However, the paper does not provide a detailed analysis of the types of mathematical problems or proof techniques covered in the synthetic dataset. It would be useful to know the distribution of topics, difficulty levels, and proof strategies represented, as this could inform the strengths and limitations of the approach.

Additionally, while the model's performance on the Lean 4 benchmarks is impressive, it is still far from human-level mastery of formal theorem proving. The researchers acknowledge that the model only solved a small fraction of the FIMO problems, and further work is needed to expand its capabilities.

One potential concern is the reliance on automated proof generation, which may introduce biases or errors into the synthetic dataset. The researchers do mention manual intervention, but it is unclear how extensive this process was and how it affected the quality of the proofs.

Overall, this work represents an important step forward in the quest to develop LLMs that can excel at formal mathematical reasoning. The availability of the synthetic dataset and the improved DeepSeekMath model will undoubtedly spur further research and innovation in this exciting field.

Conclusion

This paper presents a novel approach to generate large-scale synthetic data for training large language models in the domain of formal mathematical proof verification and generation. By translating natural language math competition problems into formal Lean 4 statements and automatically generating proofs, the researchers were able to create a dataset of over 8 million formal statements with proofs.

When the researchers fine-tuned their DeepSeekMath 7B model on this synthetic dataset, it demonstrated significant improvements in whole-proof generation accuracy on standard benchmarks, outperforming other state-of-the-art approaches. The model even managed to solve a handful of challenging Lean 4 Formalized International Mathematical Olympiad (FIMO) problems, showcasing its potential in formal theorem proving.

This work highlights the power of leveraging large-scale synthetic data to enhance the capabilities of large language models in specialized domains like mathematics. By creating a custom dataset tailored to the unique challenges of formal proof verification and generation, the researchers were able to push the boundaries of what is possible with these powerful AI systems. Both the dataset and the improved DeepSeekMath model will be made publicly available, paving the way for further advancements in this exciting field of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, Kai Chen

Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at https://github.com/InternLM/InternLM-Math and our data at https://huggingface.co/datasets/InternLM/Lean-Workbook.

6/10/2024

cs.CL

💬

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, Daya Guo

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

4/30/2024

cs.CL cs.AI cs.LG

A Survey on Deep Learning for Theorem Proving

Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, Xujie Si

Theorem proving is a fundamental aspect of mathematics, spanning from informal reasoning in mathematical language to rigorous derivations in formal systems. In recent years, the advancement of deep learning, especially the emergence of large language models, has sparked a notable surge of research exploring these techniques to enhance the process of theorem proving. This paper presents a pioneering comprehensive survey of deep learning for theorem proving by offering i) a thorough review of existing approaches across various tasks such as autoformalization, premise selection, proofstep generation, and proof search; ii) a meticulous summary of available datasets and strategies for data generation; iii) a detailed analysis of evaluation metrics and the performance of state-of-the-art; and iv) a critical discussion on the persistent challenges and the promising avenues for future exploration. Our survey aims to serve as a foundational reference for deep learning approaches in theorem proving, seeking to catalyze further research endeavors in this rapidly growing field.

4/16/2024

cs.AI

Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data

Haolong Li, Yu Ma, Yinqi Zhang, Chen Ye, Jie Chen

Large Language Models (LLMs) have shown excellent performance in language understanding, text generation, code synthesis, and many other tasks, while they still struggle in complex multi-step reasoning problems, such as mathematical reasoning. In this paper, through a newly proposed arithmetical puzzle problem, we show that the model can perform well on multi-step reasoning tasks via fine-tuning on high-quality synthetic data. Experimental results with the open-llama-3B model on three different test datasets show that not only the model can reach a zero-shot pass@1 at 0.44 on the in-domain dataset, it also demonstrates certain generalization capabilities on the out-of-domain datasets. Specifically, this paper has designed two out-of-domain datasets in the form of extending the numerical range and the composing components of the arithmetical puzzle problem separately. The fine-tuned models have shown encouraging performance on these two far more difficult tasks with the zero-shot pass@1 at 0.33 and 0.35, respectively.

6/5/2024

cs.CL