InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

2402.06332

Published 5/27/2024 by Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang and 12 others

cs.CL

💬

Abstract

The math abilities of large language models can represent their abstract reasoning ability. In this paper, we introduce and open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2. We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format and supervise our model to be a versatile math reasoner, verifier, prover, and augmenter. These abilities can be used to develop the next math LLMs or self-iteration. InternLM-Math obtains open-sourced state-of-the-art performance under the setting of in-context learning, supervised fine-tuning, and code-assisted reasoning in various informal and formal benchmarks including GSM8K, MATH, Hungary math exam, MathBench-ZH, and MiniF2F. Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning. We further explore how to use LEAN to solve math problems and study its performance under the setting of multi-task learning which shows the possibility of using LEAN as a unified platform for solving and proving in math. Our models, codes, and data are released at url{https://github.com/InternLM/InternLM-Math}.

Create account to get full access

Overview

This paper introduces InternLM-Math, a large language model (LLM) trained for advanced mathematical reasoning.
InternLM-Math is built by continuing pre-training on the InternLM2 model, and unifying various math-related capabilities like chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpretation.
The model achieves state-of-the-art performance on various math benchmarks, including GSM8K, MATH, the Hungary math exam, MathBench-ZH, and MiniF2F.
The paper also explores using the LEAN theorem prover to solve math problems in a multi-task learning setting.

Plain English Explanation

The researchers behind this paper wanted to see how well large language models (LLMs) could handle advanced mathematical reasoning tasks. To do this, they created a new model called InternLM-Math, which was built by taking an existing LLM (called InternLM2) and continuing to train it on a variety of math-related capabilities.

These capabilities include things like:

Chain-of-thought reasoning: The ability to break down a complex math problem into a step-by-step thought process.
Reward modeling: Learning to identify when a solution is correct or incorrect.
Formal reasoning: The ability to work with mathematical proofs and formal logic.
Data augmentation: Generating new training data to help the model learn better.
Code interpretation: Understanding and working with computer code related to math problems.

By combining all of these abilities, the researchers were able to create a very versatile math-focused LLM. When they tested InternLM-Math on a range of math benchmarks, they found that it performed better than other state-of-the-art models.

The paper also explores using a tool called LEAN, which is a theorem prover, to help the model solve math problems. Theorem provers are computer programs that can automatically prove mathematical statements are true or false. The researchers found that using LEAN in a multi-task learning setting showed promise for further improving the model's math skills.

Overall, this research represents an important step forward in developing LLMs that can handle advanced mathematical reasoning, which could have valuable applications in fields like science, engineering, and education.

Technical Explanation

The key technical elements of this paper are:

Model Architecture: The researchers started with the InternLM2 model, which is a large language model trained on a broad corpus of text. They then continued pre-training this model on a variety of math-related tasks and datasets, including chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpretation.
Benchmark Evaluation: The researchers evaluated InternLM-Math on several math-focused benchmarks, including GSM8K, MATH, the Hungary math exam, MathBench-ZH, and MiniF2F. The model achieved state-of-the-art performance on these benchmarks, even without fine-tuning.
LEAN Integration: The researchers explored using the LEAN theorem prover to assist InternLM-Math in solving math problems. They found that incorporating LEAN into a multi-task learning setting showed promise for further improving the model's mathematical reasoning capabilities.
Open-sourcing: The researchers have made the InternLM-Math model, code, and data publicly available at https://github.com/InternLM/InternLM-Math. This allows other researchers and developers to build upon their work.

Overall, this paper demonstrates the potential for large language models to excel at advanced mathematical reasoning tasks when trained on the right combination of math-focused capabilities and datasets.

Critical Analysis

The researchers in this paper have made a significant contribution to the field of mathematical reasoning in large language models. However, there are a few potential limitations and areas for further research that could be considered:

Generalization: While InternLM-Math performed well on the benchmarks tested, it's unclear how well the model would generalize to completely novel math problems or domains. Further evaluation on a broader range of mathematical tasks would help assess the model's true capabilities.
Interpretability: Large language models like InternLM-Math can be difficult to interpret, making it challenging to understand the reasoning behind their solutions. Developing more interpretable approaches to mathematical reasoning in LLMs could be an important area for future research.
Formal Verification: The integration of the LEAN theorem prover is a promising step towards formal verification of the model's reasoning. However, more work may be needed to fully ensure the correctness and reliability of InternLM-Math's mathematical proofs.
Efficiency: The computational resources required to train and run large language models like InternLM-Math can be substantial. Exploring more efficient model architectures or training techniques could help make these models more practical for real-world applications.

Overall, this paper represents an exciting advancement in the field of mathematical reasoning in large language models. By continuing to push the boundaries of what these models can achieve, researchers may unlock new possibilities for AI-powered tools that can assist and enhance human mathematical abilities.

Conclusion

This paper introduces InternLM-Math, a large language model that has been trained to excel at a variety of mathematical reasoning tasks. By unifying capabilities like chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpretation, the researchers were able to create a versatile math-focused model that outperforms other state-of-the-art approaches.

The model's strong performance on benchmarks like GSM8K, MATH, the Hungary math exam, MathBench-ZH, and MiniF2F suggests that large language models can be powerful tools for advanced mathematical reasoning. The researchers' exploration of using the LEAN theorem prover also shows promise for further enhancing these models' mathematical capabilities.

Overall, this work represents an important step forward in the development of AI systems that can assist and enhance human mathematical abilities across a wide range of applications, from scientific research to educational support. By continuing to push the boundaries of what large language models can achieve, researchers may unlock new possibilities for AI-powered tools that can transform the way we approach and solve complex mathematical problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Caught in the Quicksand of Reasoning, Far from AGI Summit: Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, Soujanya Poria

Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce: (i) a general ontology of perturbations for maths and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, MORE and CORE, respectively, of perturbed maths and coding problems to probe the limits of LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open source the datasets and source codes at: https://github.com/declare-lab/llm_robustness.

6/28/2024

cs.CL

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Jikun Kang, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li, Xu He, Quan He, Feng Wen, Jianye Hao, Jun Yao

Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.

6/27/2024

cs.LG

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

5/6/2024

cs.CL cs.AI

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, Wenpeng Yin

Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.

4/8/2024

cs.CL