JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

2405.14365

Published 5/24/2024 by Kun Zhou, Beichen Zhang, Jiapeng Wang, Zhipeng Chen, Wayne Xin Zhao, Jing Sha, Zhichao Sheng, Shijin Wang, Ji-Rong Wen

cs.CL cs.AI

🏋️

Abstract

Mathematical reasoning is an important capability of large language models~(LLMs) for real-world applications. To enhance this capability, existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs (eg GPT-4) to synthesize massive math problems. Both types of work generally lead to large costs in training or synthesis. To reduce the cost, based on open-source available texts, we propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data. To achieve it, we create a dataset using GPT-4 to distill its data synthesis capability into the small LLM. Concretely, we craft a set of prompts based on human education stages to guide GPT-4, to synthesize problems covering diverse math knowledge and difficulty levels. Besides, we adopt the gradient-based influence estimation method to select the most valuable math-related texts. The both are fed into GPT-4 for creating the knowledge distillation dataset to train the small LLM. We leverage it to synthesize 6 million math problems for pre-training our JiuZhang3.0 model, which only needs to invoke GPT-4 API 9.3k times and pre-train on 4.6B data. Experimental results have shown that JiuZhang3.0 achieves state-of-the-art performance on several mathematical reasoning datasets, under both natural language reasoning and tool manipulation settings. Our code and data will be publicly released in url{https://github.com/RUCAIBox/JiuZhang3.0}.

Create account to get full access

Overview

This paper proposes an efficient way to train a small language model for synthesizing high-quality math problems, which can then be used to pre-train a larger math reasoning model.
The key ideas are: 1) using GPT-4 to distill its math problem generation capability into a smaller model, and 2) selectively choosing the most valuable math-related texts for pre-training.
The resulting model, JiuZhang3.0, achieves state-of-the-art performance on various math reasoning benchmarks.

Plain English Explanation

Large language models (LLMs) like GPT-4 have shown impressive capabilities in mathematical reasoning, which is crucial for real-world applications. However, training LLMs for this task is generally very costly, requiring either large amounts of specialized math-related data or the ability to synthesize massive numbers of math problems.

To reduce these costs, the researchers propose an efficient approach. They first use GPT-4 to generate a large dataset of high-quality math problems, covering diverse knowledge and difficulty levels. This is done by crafting prompts based on human education stages to guide GPT-4's problem generation. They also use a method called gradient-based influence estimation to select the most valuable math-related texts from available sources.

The researchers then use this dataset to train a smaller LLM, effectively distilling GPT-4's math problem synthesis capability into a more efficient model. This smaller model, called JiuZhang3.0, is then used to generate 6 million math problems for pre-training, which is much less costly than directly using GPT-4.

The end result is that JiuZhang3.0 achieves state-of-the-art performance on various math reasoning benchmarks, including both natural language reasoning and tool manipulation settings. This demonstrates the effectiveness of the researchers' approach in enhancing mathematical reasoning capabilities in language models.

Technical Explanation

The key technical contributions of this paper are:

Dataset Synthesis: The researchers use GPT-4 to synthesize a large dataset of high-quality math problems. They craft a set of prompts based on human education stages to guide GPT-4's problem generation, covering diverse math knowledge and difficulty levels.
Gradient-based Influence Estimation: To select the most valuable math-related texts for pre-training, the researchers adopt the gradient-based influence estimation method. This allows them to identify the texts that are most influential for the model's math reasoning capabilities.
Knowledge Distillation: The researchers use the synthetic dataset and selected math-related texts to train a smaller LLM, effectively distilling GPT-4's math problem synthesis capability into a more efficient model. This smaller model, JiuZhang3.0, is then used to generate 6 million math problems for pre-training.
Evaluation: The researchers evaluate JiuZhang3.0 on several math reasoning datasets, including both natural language reasoning and tool manipulation settings. The results show that JiuZhang3.0 achieves state-of-the-art performance, demonstrating the effectiveness of their approach.

Critical Analysis

The researchers have presented a novel and efficient approach to enhancing the mathematical reasoning capabilities of language models. However, there are a few potential limitations and areas for further research:

The reliance on GPT-4 for the initial problem synthesis may limit the scalability of the approach, as access to such a powerful model may not be readily available to all researchers and developers.
The paper does not provide a detailed analysis of the quality and diversity of the generated math problems, which could be an important factor in the effectiveness of the pre-training process.
The researchers could explore additional techniques, such as few-shot learning or weak supervision, to further improve the efficiency and performance of the math reasoning model.
It would be valuable to see the researchers apply their approach to other types of reasoning tasks, such as theorem proving or code understanding, to assess its broader applicability.

Overall, the researchers have presented a promising approach to enhancing mathematical reasoning in language models, and their work could have significant implications for a wide range of real-world applications.

Conclusion

This paper proposes an efficient and effective approach to training a language model with strong mathematical reasoning capabilities. By using GPT-4 to synthesize a large dataset of high-quality math problems and selectively choosing the most valuable math-related texts, the researchers were able to train a smaller model, JiuZhang3.0, that outperforms state-of-the-art models on various math reasoning benchmarks.

This work demonstrates the potential of leveraging powerful language models like GPT-4 to enhance the mathematical reasoning abilities of more efficient models, reducing the cost and resource requirements for such capabilities. The researchers' approach could have significant implications for a wide range of applications that require advanced mathematical reasoning, from scientific and engineering fields to educational and financial services.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen

Large language models (LLMs) have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% PASS@1 accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 70B range and best commercial models like GPT-4 across multiple math reasoning datasets.

5/9/2024

cs.CL cs.AI

Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data

Haolong Li, Yu Ma, Yinqi Zhang, Chen Ye, Jie Chen

Large Language Models (LLMs) have shown excellent performance in language understanding, text generation, code synthesis, and many other tasks, while they still struggle in complex multi-step reasoning problems, such as mathematical reasoning. In this paper, through a newly proposed arithmetical puzzle problem, we show that the model can perform well on multi-step reasoning tasks via fine-tuning on high-quality synthetic data. Experimental results with the open-llama-3B model on three different test datasets show that not only the model can reach a zero-shot pass@1 at 0.44 on the in-domain dataset, it also demonstrates certain generalization capabilities on the out-of-domain datasets. Specifically, this paper has designed two out-of-domain datasets in the form of extending the numerical range and the composing components of the arithmetical puzzle problem separately. The fine-tuned models have shown encouraging performance on these two far more difficult tasks with the zero-shot pass@1 at 0.33 and 0.35, respectively.

6/5/2024

cs.CL

📊

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang

Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

5/24/2024

cs.AI

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Jikun Kang, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li, Xu He, Quan He, Feng Wen, Jianye Hao, Jun Yao

Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.

6/27/2024

cs.LG