DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

2402.03300

Published 4/30/2024 by Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu and 1 other

cs.CL cs.AI cs.LG

💬

Abstract

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

Create account to get full access

Overview

The paper introduces DeepSeekMath 7B, a large language model trained on a vast amount of math-related data to improve its mathematical reasoning capabilities.
DeepSeekMath 7B achieves impressive performance on the competition-level MATH benchmark, approaching the level of state-of-the-art models like Gemini-Ultra and GPT-4.
The paper attributes the model's mathematical reasoning abilities to two key factors: leveraging publicly available web data and introducing a novel optimization technique called Group Relative Policy Optimization (GRPO).

Plain English Explanation

The paper presents a new large language model called DeepSeekMath 7B that is specifically designed to excel at mathematical reasoning. Mathematical reasoning is a significant challenge for language models due to the complex and structured nature of mathematics.

To address this challenge, the researchers behind DeepSeekMath 7B took two key steps. First, they gathered a massive amount of math-related data from the web, including 120B math-related tokens from Common Crawl. This allowed the model to learn a deep understanding of mathematical concepts and problem-solving strategies.

Second, the researchers introduced a new optimization technique called Group Relative Policy Optimization (GRPO), which is a variant of the well-known Proximal Policy Optimization (PPO) algorithm. GRPO helps the model develop stronger mathematical reasoning abilities while also improving its memory usage, making it more efficient.

The results are impressive: DeepSeekMath 7B achieves a score of 51.7% on the challenging MATH benchmark, approaching the performance of cutting-edge models like Gemini-Ultra and GPT-4. When the model's self-consistency is taken into account, the score rises to 60.9%, further demonstrating its mathematical prowess.

This research represents a significant step forward in the field of large language models for mathematical reasoning, and it has the potential to impact various domains that rely on advanced mathematical skills, such as scientific research, engineering, and education.

Technical Explanation

The paper introduces DeepSeekMath 7B, a large language model that has been pre-trained on a massive amount of math-related data from Common Crawl, totaling 120 billion tokens. This data, combined with natural language and code data, is used to continue the pre-training of the DeepSeek-Coder-Base-v1.5 7B model.

The key innovation in this work is the use of a novel optimization technique called Group Relative Policy Optimization (GRPO), which is a variant of the Proximal Policy Optimization (PPO) algorithm. GRPO is designed to enhance the model's mathematical reasoning abilities while also improving its memory usage, making it more efficient.

The researchers evaluate the performance of DeepSeekMath 7B on the competition-level MATH benchmark, and the model achieves an impressive score of 51.7% without relying on external toolkits or voting techniques. This performance level approaches that of state-of-the-art models like Gemini-Ultra and GPT-4.

Furthermore, the researchers demonstrate that leveraging the self-consistency of the model's outputs over 64 samples can further improve the performance, reaching a score of 60.9% on the MATH benchmark.

The paper attributes the strong mathematical reasoning capabilities of DeepSeekMath 7B to two key factors: the extensive math-related data used for pre-training and the introduction of the GRPO optimization technique.

Critical Analysis

The paper presents a compelling approach to improving the mathematical reasoning capabilities of large language models, and the results achieved by DeepSeekMath 7B are impressive. However, there are a few potential limitations and areas for further research that could be considered.

First, the paper does not provide a detailed analysis of the types of mathematical problems or concepts that DeepSeekMath 7B excels or struggles with. A more granular evaluation of the model's strengths and weaknesses could help identify areas for future improvements.

Additionally, the paper does not address the potential generalization of the GRPO technique to other types of reasoning tasks beyond mathematics. It would be interesting to explore the broader applicability of this optimization method and its impact on other domains.

Furthermore, the paper does not discuss the computational and resource requirements of training DeepSeekMath 7B, which could be a critical factor in the model's real-world deployability and scalability. Insights into the trade-offs between performance and efficiency would be valuable for the research community.

Despite these potential areas for further exploration, the overall approach and the results presented in the paper represent a significant step forward in the field of large language models for mathematical reasoning. The research has the potential to inspire future work and contribute to the development of more capable and accessible mathematical AI systems.

Conclusion

The paper introduces DeepSeekMath 7B, a large language model that has been specifically designed and trained to excel at mathematical reasoning. By leveraging a vast amount of math-related web data and introducing a novel optimization technique called Group Relative Policy Optimization (GRPO), the researchers have achieved impressive results on the challenging MATH benchmark.

DeepSeekMath 7B's performance, which approaches that of state-of-the-art models like Gemini-Ultra and GPT-4, demonstrates the significant potential of this approach and its broader implications for fields that rely on advanced mathematical skills. The research represents an important step forward in the ongoing efforts to develop large language models that can effectively tackle complex mathematical problems and reasoning tasks.

As the field of large language models for mathematical reasoning continues to evolve, the insights and techniques presented in this paper are likely to inspire further advancements and contribute to the development of even more capable and versatile mathematical AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📊

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang

Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

5/24/2024

cs.AI

💬

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, Dahua Lin

The math abilities of large language models can represent their abstract reasoning ability. In this paper, we introduce and open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2. We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format and supervise our model to be a versatile math reasoner, verifier, prover, and augmenter. These abilities can be used to develop the next math LLMs or self-iteration. InternLM-Math obtains open-sourced state-of-the-art performance under the setting of in-context learning, supervised fine-tuning, and code-assisted reasoning in various informal and formal benchmarks including GSM8K, MATH, Hungary math exam, MathBench-ZH, and MiniF2F. Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning. We further explore how to use LEAN to solve math problems and study its performance under the setting of multi-task learning which shows the possibility of using LEAN as a unified platform for solving and proving in math. Our models, codes, and data are released at url{https://github.com/InternLM/InternLM-Math}.

5/27/2024

cs.CL

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, Wenfeng Liang

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

6/19/2024

cs.SE cs.AI cs.LG

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

5/6/2024

cs.CL cs.AI