Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement

2305.14497

Published 4/19/2024 by Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Tao Gui, Qi Zhang, Xuanjing Huang

💬

Abstract

To enhance the multi-step reasoning capabilities of large language models, researchers have extensively explored prompting methods, notably the Chain-of-Thought (CoT) method which explicitly elicits human-like rationales. However, they have inadvertently overlooked the potential of enhancing model reasoning performance by formulating higher-quality problems. In this work, we start from the problem side and propose Self-Polish (SP), a novel method that facilitates the model's reasoning by guiding it to progressively refine the given problems to be more comprehensible and solvable. We also explore several automatic prompting varients and propose the Self-Polish prompt bank for the community. SP is orthogonal to all other prompting methods of answer/reasoning side like CoT, allowing for seamless integration with state-of-the-art techniques for further improvement. Thorough experiments show that the proposed method attains notable and consistent effectiveness on five reasoning benchmarks across different models. Furthermore, our method also showcases impressive performance on robustness evaluation. Codes and prompts are available at https://github.com/WooooDyy/Self-Polish.

Create account to get full access

Overview

Researchers have extensively explored prompting methods, such as the Chain-of-Thought (CoT) technique, to enhance the multi-step reasoning capabilities of large language models.
However, the potential of enhancing model reasoning performance by formulating higher-quality problems has been overlooked.
This work proposes a novel method called Self-Polish (SP) that facilitates the model's reasoning by guiding it to progressively refine the given problems to be more comprehensible and solvable.
The researchers also explore several automatic prompting variants and propose the Self-Polish prompt bank for the community.

Plain English Explanation

Large language models, like GPT-3 and Transformer-based models, have made impressive progress in understanding and generating human-like text. However, they still struggle with complex, multi-step reasoning tasks.

To address this, researchers have tried various techniques, like the Chain-of-Thought (CoT) method, which explicitly prompts the model to provide step-by-step explanations for its answers. While this has helped, the researchers in this study noticed another potential approach: improving the quality of the problems themselves.

Their proposed method, called Self-Polish (SP), guides the language model to refine the given problems, making them more understandable and easier to solve. This is done by prompting the model to progressively rephrase and clarify the problem statement. The researchers also created a library of these self-polishing prompts for others to use.

Importantly, Self-Polish is complementary to other prompting techniques, like CoT, and can be used together with them to further enhance the model's reasoning abilities. The researchers found that this approach leads to consistent improvements across several different reasoning benchmarks and models.

Technical Explanation

The researchers start by noting that while prompting methods like Chain-of-Thought (CoT) have been extensively explored to improve the multi-step reasoning capabilities of large language models, the potential of enhancing model reasoning performance by formulating higher-quality problems has been overlooked.

To address this, they propose a novel method called Self-Polish (SP) that facilitates the model's reasoning by guiding it to progressively refine the given problems to be more comprehensible and solvable. The key idea is to prompt the model to rephrase and clarify the problem statement, rather than just focusing on eliciting step-by-step explanations for the final answer.

The researchers also explore several automatic prompting variants and propose the Self-Polish prompt bank for the community. Importantly, SP is orthogonal to all other prompting methods focused on the answer/reasoning side, such as CoT, allowing for seamless integration with state-of-the-art techniques for further improvement.

Thorough experiments on five reasoning benchmarks across different models show that the proposed Self-Polish method attains notable and consistent effectiveness. Furthermore, the method also showcases impressive performance on robustness evaluation, demonstrating its ability to handle challenging problem formulations.

Critical Analysis

The researchers acknowledge that while their Self-Polish method shows promising results, there are still some limitations and areas for further exploration. For example, the paper does not delve into the specific mechanisms by which the model refines the problem statements, nor does it provide a deep analysis of the types of problem reformulations that lead to the greatest improvements.

Additionally, the researchers mention that their method relies on the availability of high-quality seed prompts, and the performance may be sensitive to the quality of these prompts. Exploring more robust and automated ways of generating these prompts could be an interesting direction for future research.

Another potential area for improvement is to investigate how the Self-Polish method could be applied to more open-ended problem-solving tasks, beyond the structured reasoning benchmarks used in this study. Adapting the approach to handle ill-defined problems or real-world scenarios may require additional considerations.

Despite these limitations, the Self-Polish method represents an important step forward in enhancing the reasoning capabilities of large language models. By shifting the focus to the problem formulation, rather than solely the answer generation, the researchers have uncovered a promising avenue for further advancements in this field.

Conclusion

This work proposes a novel method called Self-Polish (SP) that aims to improve the multi-step reasoning capabilities of large language models by guiding them to progressively refine the given problems to be more comprehensible and solvable.

The key insight is that in addition to exploring prompting techniques that focus on the answer/reasoning side, such as Chain-of-Thought (CoT), there is significant potential in enhancing model performance by formulating higher-quality problems.

The researchers have demonstrated the effectiveness of their Self-Polish method across various reasoning benchmarks and models, and have also provided the community with a set of self-polishing prompts to build upon. This work represents an important step towards more robust and capable language models that can tackle complex, multi-step reasoning tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Pattern-Aware Chain-of-Thought Prompting in Large Language Models

Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, Jinqiao Wang

Chain-of-thought (CoT) prompting can guide language models to engage in complex multi-step reasoning. The quality of provided demonstrations significantly impacts the success of downstream inference tasks. While existing automated methods prioritize accuracy and semantics in these demonstrations, we show that the underlying reasoning patterns play a more crucial role in such tasks. In this paper, we propose Pattern-Aware CoT, a prompting method that considers the diversity of demonstration patterns. By incorporating patterns such as step length and reasoning process within intermediate steps, PA-CoT effectively mitigates the issue of bias induced by demonstrations and enables better generalization to diverse scenarios. We conduct experiments on nine reasoning benchmark tasks using two open-source LLMs. The results show that our method substantially enhances reasoning performance and exhibits robustness to errors. The code will be made publicly available.

4/24/2024

cs.CL

💬

Active Prompting with Chain-of-Thought for Large Language Models

Shizhe Diao, Pengcheng Wang, Yong Lin, Tong Zhang

The increasing scale of large language models (LLMs) brings emergent abilities to various complex tasks requiring reasoning, such as arithmetic and commonsense reasoning. It is known that the effective design of task-specific prompts is critical for LLMs' ability to produce high-quality answers. In particular, an effective approach for complex question-and-answer tasks is example-based prompting with chain-of-thought (CoT) reasoning, which significantly improves the performance of LLMs. However, current CoT methods rely on a fixed set of human-annotated exemplars, which are not necessarily the most effective examples for different tasks. This paper proposes a new method, Active-Prompt, to adapt LLMs to different tasks with task-specific example prompts (annotated with human-designed CoT reasoning). For this purpose, we propose a solution to the key problem of determining which questions are the most important and helpful ones to annotate from a pool of task-specific queries. By borrowing ideas from the related problem of uncertainty-based active learning, we introduce several metrics to characterize the uncertainty so as to select the most uncertain questions for annotation. Experimental results demonstrate the superiority of our proposed method, achieving state-of-the-art on eight complex reasoning tasks. Further analyses of different uncertainty metrics, pool sizes, zero-shot learning, and accuracy-uncertainty relationship demonstrate the effectiveness of our method. Our code will be available at https://github.com/shizhediao/active-prompt.

6/10/2024

cs.CL

💬

Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios

Lei Lin, Jiayi Fu, Pengli Liu, Qingyang Li, Yan Gong, Junchen Wan, Fuzheng Zhang, Zhongyuan Wang, Di Zhang, Kun Gai

Although chain-of-thought (CoT) prompting combined with language models has achieved encouraging results on complex reasoning tasks, the naive greedy decoding used in CoT prompting usually causes the repetitiveness and local optimality. To address this shortcoming, ensemble-optimization tries to obtain multiple reasoning paths to get the final answer assembly. However, current ensemble-optimization methods either simply employ rule-based post-processing such as textit{self-consistency}, or train an additional model based on several task-related human annotations to select the best one among multiple reasoning paths, yet fail to generalize to realistic settings where the type of input questions is unknown or the answer format of reasoning paths is unknown. To avoid their limitations, we propose textbf{Self-Agreement}, a generalizable ensemble-optimization method applying in almost all scenarios where the type of input questions and the answer format of reasoning paths may be known or unknown. Self-agreement firstly samples from language model's decoder to generate a textit{diverse} set of reasoning paths, and subsequently prompts the language model textit{one more time} to determine the optimal answer by selecting the most textit{agreed} answer among the sampled reasoning paths. Self-agreement simultaneously achieves remarkable performance on six public reasoning benchmarks and superior generalization capabilities.

5/27/2024

cs.CL cs.AI

💬

Boosting Language Models Reasoning with Chain-of-Knowledge Prompting

Jianing Wang, Qiushi Sun, Xiang Li, Ming Gao

Recently, Chain-of-Thought (CoT) prompting has delivered success on complex reasoning tasks, which aims at designing a simple prompt like ``Let's think step by step'' or multiple in-context exemplars with well-designed rationales to elicit Large Language Models (LLMs) to generate intermediate reasoning steps. However, the generated rationales often come with mistakes, making unfactual and unfaithful reasoning chains. To mitigate this brittleness, we propose a novel Chain-of-Knowledge (CoK) prompting, where we aim at eliciting LLMs to generate explicit pieces of knowledge evidence in the form of structure triple. This is inspired by our human behaviors, i.e., we can draw a mind map or knowledge map as the reasoning evidence in the brain before answering a complex question. Benefiting from CoK, we additionally introduce a F^2-Verification method to estimate the reliability of the reasoning chains in terms of factuality and faithfulness. For the unreliable response, the wrong evidence can be indicated to prompt the LLM to rethink. Extensive experiments demonstrate that our method can further improve the performance of commonsense, factual, symbolic, and arithmetic reasoning tasks.

6/4/2024

cs.CL