ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search






Published 6/7/2024 by Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, Jie Tang
Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST$^text{EM}$ and Self-Rewarding LM.

  • This paper presents ReST-MCTS*, a novel approach that combines Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to enable self-training and improve reasoning capabilities.
  • The key idea is to use MCTS to explore and evaluate different reasoning paths, guided by a reward function that captures the "process quality" of the reasoning, rather than just the final outcome.
  • This process-focused reward allows the LLM to learn and improve its reasoning skills in a self-supervised manner, without the need for extensive labeled data.

Plain English Explanation

The researchers developed a system called ReST-MCTS* that combines the power of large language models (LLMs) with a technique called Monte Carlo Tree Search (MCTS). LLMs are AI systems that can generate human-like text, while MCTS is a way of exploring different options and evaluating them to find the best course of action.

In this case, the researchers used MCTS to explore different reasoning paths that the LLM could take. Instead of just looking at the final result, the system evaluated the "quality" of the reasoning process itself, using a special reward function. This allowed the LLM to learn and improve its reasoning skills on its own, without needing a lot of labeled training data.

The key insight is that by focusing on the reasoning process, rather than just the final answer, the LLM can learn to reason more effectively and solve complex problems better over time. This self-training approach could be a powerful way to help large language models become more capable and reliable, without the need for manual supervision or labeling of every training example.

Technical Explanation

The researchers present ReST-MCTS*, a system that integrates Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) to enable self-training and improved reasoning capabilities.

The core idea is to use MCTS to explore and evaluate different reasoning paths that the LLM can take, guided by a "process reward" function that captures the quality of the reasoning process, rather than just the final outcome. This process-focused reward allows the LLM to learn and improve its reasoning skills in a self-supervised manner, without the need for extensive labeled data.

The researchers draw inspiration from prior work on learning planning-based reasoning and value model generation to design the ReST-MCTS* system. They demonstrate the effectiveness of their approach on various reasoning tasks, showing that the LLM can progressively improve its reasoning capabilities through this self-training process.

Critical Analysis

The researchers acknowledge several limitations and areas for future research in their paper. For example, the process reward function is a key component, and further work is needed to understand how to design effective reward functions for different types of reasoning tasks.

Additionally, the paper does not extensively explore the potential biases or safety concerns that may arise from this self-training approach. As LLMs become more capable through this process, it will be important to carefully monitor their outputs and behaviors to ensure they are aligned with ethical and societal values.

Another area for further investigation is the scalability of the ReST-MCTS* approach. The researchers demonstrate the method on relatively small-scale tasks, and it remains to be seen how well it would perform on more complex, real-world reasoning problems.

Overall, the ReST-MCTS* system represents an intriguing step towards self-improvement of LLMs via imagination searching, but there are still many open questions and challenges to be addressed in this promising area of research.


The ReST-MCTS* system presented in this paper offers a novel approach to enabling large language models to self-train and improve their reasoning capabilities. By integrating MCTS with a process-focused reward function, the researchers have demonstrated a way for LLMs to progressively learn and enhance their problem-solving skills without the need for extensive labeled data.

This work represents an important step towards more autonomous and capable AI systems that can learn and reason in a more self-directed manner. As the field of artificial intelligence continues to advance, techniques like ReST-MCTS* may play a key role in helping large language models become more robust, reliable, and beneficial to society.

