Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Read original: arXiv:2408.06195 - Published 8/13/2024 by Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, Mao Yang

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Overview

The paper explores a mutual reasoning approach to enhance the problem-solving abilities of smaller language models.
This involves training models to engage in mutual reasoning, where they learn to ask and answer clarifying questions to improve their understanding and solutions.
Experiments show this approach enables smaller models to outperform larger models on complex reasoning tasks.

Plain English Explanation

The paper investigates a way to make smaller language models stronger at problem-solving. The key idea is mutual reasoning - where the model learns to ask itself clarifying questions and then use those answers to improve its solutions.

Normally, larger language models perform better on complex reasoning tasks compared to smaller models. However, this paper demonstrates that by training smaller models to engage in this mutual reasoning process, they can actually outperform their larger counterparts.

The mutual reasoning approach works as follows:

The model is presented with a problem to solve.
It then generates a set of clarifying questions about the problem to better understand it.
The model answers those questions itself, refining its understanding.
Finally, it uses this improved understanding to generate a higher-quality solution to the original problem.

This back-and-forth of questioning and answering allows the smaller model to compensate for its more limited knowledge and capabilities. By actively reasoning about the problem, it can achieve better results than a larger model that simply tries to generate a solution directly.

Technical Explanation

The paper introduces a mutual reasoning approach to enhance the problem-solving abilities of smaller language models.

In this method, the model is trained to:

Generate clarifying questions about the given problem statement to better understand it.
Answer those questions itself, refining its internal representation of the problem.
Use this improved understanding to generate a higher-quality solution.

This mutual questioning and answering process allows smaller models to compensate for their more limited knowledge and capabilities compared to larger models. By actively reasoning about the problem, they can outperform larger models that simply try to generate a solution directly.

The authors evaluate this approach on a range of complex reasoning tasks, including mathematical word problems and open-ended question answering. Their experiments show that smaller models trained with mutual reasoning significantly outperform their larger counterparts, sometimes by a large margin.

Critical Analysis

The paper presents a compelling approach to enhancing the problem-solving abilities of smaller language models. The key strength is the mutual reasoning mechanism, which allows the models to actively refine their understanding of the problem through a process of clarifying questions and self-reflection.

However, the authors acknowledge several limitations and areas for further research:

Task Generalization: While the mutual reasoning approach showed strong results on the evaluated tasks, it's unclear how well it would generalize to a wider range of problem types or domains.
Computational Efficiency: The additional steps of question generation and answering may increase the computational overhead compared to a direct solution approach, which could be a concern for real-world deployment.
Scaling to Larger Models: The paper focuses on enhancing smaller models, but it's unclear if the mutual reasoning approach would provide similar benefits for larger, more capable models.

Additionally, future research could explore:

Integrating External Knowledge: Allowing the models to access relevant external information during the mutual reasoning process may further boost their problem-solving abilities.
Explainability and Transparency: Investigating how the mutual reasoning process can be made more interpretable and transparent, rather than treating it as a "black box" solution.

Overall, the mutual reasoning approach is a promising direction for improving the problem-solving skills of smaller language models, but additional research is needed to fully understand its limitations and potential real-world applications.

Conclusion

This paper presents a novel mutual reasoning approach that enables smaller language models to outperform their larger counterparts on complex reasoning tasks. By training models to engage in a process of generating clarifying questions, answering them, and using the improved understanding to generate solutions, the authors demonstrate significant performance gains.

This work highlights the potential for enhancing the capabilities of smaller, more efficient models through sophisticated reasoning strategies, rather than simply relying on scale. As language models continue to play an increasingly important role in AI systems, techniques like mutual reasoning could be crucial for developing models that are both powerful and resource-efficient.

While the results are promising, further research is needed to fully understand the limitations and broader applicability of this approach. Nonetheless, this paper represents an important step forward in the ongoing quest to build smarter, more capable AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, Mao Yang

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.

8/13/2024

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Jikun Kang, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li, Xu He, Quan He, Feng Wen, Jianye Hao, Jun Yao

Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.

6/27/2024

🏋️

V-STaR: Training Verifiers for Self-Taught Reasoners

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal

Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

8/15/2024

💬

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether small (<= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct.

6/7/2024