UNSAT Solver Synthesis via Monte Carlo Forest Search

Read original: arXiv:2211.12581 - Published 7/16/2024 by Chris Cameron, Jason Hartford, Taylor Lundy, Tuan Truong, Alan Milligan, Rex Chen, Kevin Leyton-Brown

❗

Overview

Introduces a new class of reinforcement learning (RL) algorithms called Monte Carlo Forest Search (MCFS) for learning policies in tree-structured Markov Decision Processes (tree MDPs)
Focuses on problems where policy execution involves traversing an exponential-sized tree, such as proving unsatisfiability of a SAT formula, counting the number of solutions to a satisfiable SAT formula, or finding the optimal solution to a mixed-integer program
Presents an instantiation of MCFS called Knuth Synthesis, which learns DPLL branching policies for solving the Boolean satisfiability (SAT) problem with the goal of achieving good average-case performance on a given distribution of unsatisfiable instances

Plain English Explanation

The paper introduces a new approach called Monte Carlo Forest Search (MCFS) for solving complex problems that involve searching through an exponentially large tree of possible solutions. Examples of such problems include proving that a set of logical constraints (a SAT formula) has no solutions, counting the number of solutions to a satisfiable SAT formula, or finding the best solution to a mixed-integer programming problem.

The key idea behind MCFS is to view these tree-structured problems not as finding a single good path through the tree, but rather as finding a small "tree" (or set of solutions) within a larger "forest" of candidate trees. The researchers develop a specific MCFS algorithm called Knuth Synthesis, which learns how to make early decisions in the search process that can dramatically reduce the size of the overall search tree.

Knuth Synthesis does this by leveraging two key insights: First, it uses a clever statistical technique developed by Donald Knuth to estimate the size of the search tree by randomly sampling paths through it, rather than trying to fully explore the entire tree. Second, it focuses its learning on the early decisions in the search process, which have the greatest potential to reduce the overall tree size, rather than trying to learn a policy for the entire tree.

By using these techniques, Knuth Synthesis was able to match or exceed the performance of a strong baseline on several well-known SAT problem distributions, even on instances that were two orders of magnitude more challenging than those tackled in previous reinforcement learning studies.

Technical Explanation

The paper introduces a new class of reinforcement learning (RL) algorithms called Monte Carlo Forest Search (MCFS) for learning policies in tree-structured Markov Decision Processes (tree MDPs). In these problems, policy execution involves traversing an exponential-sized tree, which makes traditional RL approaches prohibitively expensive.

MCFS algorithms can be seen as an extension of Monte Carlo Tree Search (MCTS) to cases where the goal is not to find a good path (solution) within a tree, but rather to find a small tree within a forest of candidate trees. The researchers instantiate and evaluate their ideas in an algorithm called Knuth Synthesis, an MCFS algorithm that learns DPLL branching policies for solving the Boolean satisfiability (SAT) problem.

Knuth Synthesis addresses the prohibitive costs of policy evaluations in an exponentially-sized tree using two key ideas: First, it estimates tree size by randomly sampling paths and measuring their lengths, drawing on an unbiased approximation technique due to Knuth (1975). Second, it queries a strong solver at a user-defined depth rather than learning a policy across the whole tree, focusing the policy search on early decisions that offer the greatest potential for reducing tree size.

The researchers show that Knuth Synthesis matched or exceeded the performance of a strong baseline on three well-known SAT distributions, tackling problems that were two orders of magnitude more challenging than those addressed in previous RL studies.

Critical Analysis

The paper presents a novel and promising approach to solving complex, tree-structured problems using reinforcement learning. The key insights of using statistical tree size estimation and focusing the policy search on early decisions are clever and well-justified.

However, the paper does not provide a comprehensive analysis of the limitations of the Knuth Synthesis algorithm. For example, it would be useful to understand how the performance of the algorithm scales with the size and complexity of the problem instances, or how sensitive it is to the choice of the user-defined depth at which the strong solver is queried.

Additionally, the paper does not compare Knuth Synthesis to other recent advances in RL for tree-structured problems, such as the work on proof-number based MCTS. Exploring these comparisons could provide valuable insights into the relative strengths and weaknesses of different approaches.

Overall, the paper makes a strong contribution by introducing MCFS and the Knuth Synthesis algorithm, but there is room for further research to fully understand the capabilities and limitations of this new class of RL algorithms.

Conclusion

This paper presents a novel class of reinforcement learning algorithms called Monte Carlo Forest Search (MCFS) for tackling complex, tree-structured problems. The researchers develop a specific MCFS algorithm called Knuth Synthesis, which learns DPLL branching policies for solving the Boolean satisfiability (SAT) problem.

By leveraging statistical tree size estimation and focusing the policy search on early decisions, Knuth Synthesis was able to match or exceed the performance of a strong baseline on several well-known SAT problem distributions, even on instances that were significantly more challenging than those addressed in previous RL studies.

This work opens up new avenues for applying reinforcement learning to a broader class of problems that involve searching through exponentially large trees of possibilities, with potential applications in areas like theorem proving, program synthesis, and optimization. Further research is needed to fully understand the capabilities and limitations of MCFS algorithms, but this paper represents an important step forward in this promising direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

❗

UNSAT Solver Synthesis via Monte Carlo Forest Search

Chris Cameron, Jason Hartford, Taylor Lundy, Tuan Truong, Alan Milligan, Rex Chen, Kevin Leyton-Brown

We introduce Monte Carlo Forest Search (MCFS), a class of reinforcement learning (RL) algorithms for learning policies in {tree MDPs}, for which policy execution involves traversing an exponential-sized tree. Examples of such problems include proving unsatisfiability of a SAT formula; counting the number of solutions of a satisfiable SAT formula; and finding the optimal solution to a mixed-integer program. MCFS algorithms can be seen as extensions of Monte Carlo Tree Search (MCTS) to cases where, rather than finding a good path (solution) within a tree, the problem is to find a small tree within a forest of candidate trees. We instantiate and evaluate our ideas in an algorithm that we dub Knuth Synthesis, an MCFS algorithm that learns DPLL branching policies for solving the Boolean satisfiability (SAT) problem, with the objective of achieving good average-case performance on a given distribution of unsatisfiable problem instances. Knuth Synthesis is the first RL approach to avoid the prohibitive costs of policy evaluations in an exponentially-sized tree, leveraging two key ideas: first, we estimate tree size by randomly sampling paths and measuring their lengths, drawing on an unbiased approximation due to Knuth (1975); second, we query a strong solver at a user-defined depth rather than learning a policy across the whole tree, to focus our policy search on early decisions that offer the greatest potential for reducing tree size. We matched or exceeded the performance of a strong baseline on three well-known SAT distributions, facing problems that were two orders of magnitude more challenging than those addressed in previous RL studies.

7/16/2024

Structure and Reduction of MCTS for Explainable-AI

Ronit Bustin, Claudia V. Goldman

Complex sequential decision-making planning problems, covering infinite states' space have been shown to be solvable by AlphaZero type of algorithms. Such an approach that trains a neural model while simulating projection of futures with a Monte Carlo Tree Search algorithm were shown to be applicable to real life planning problems. As such, engineers and users interacting with the resulting policy of behavior might benefit from obtaining automated explanations about these planners' decisions offline or online. This paper focuses on the information within the Monte Carlo Tree Search data structure. Given its construction, this information contains much of the reasoning of the sequential decision-making algorithm and is essential for its explainability. We show novel methods using information theoretic tools for the simplification and reduction of the Monte Carlo Tree Search and the extraction of information. Such information can be directly used for the construction of human understandable explanations. We show that basic explainability quantities can be calculated with limited additional computational cost, as an integrated part of the Monte Carlo Tree Search construction process. We focus on the theoretical and algorithmic aspects and provide examples of how the methods presented here can be used in the construction of human understandable explanations.

8/13/2024

🤯

Layered and Staged Monte Carlo Tree Search for SMT Strategy Synthesis

Zhengyang Lu, Stefan Siemer, Piyush Jha, Joel Day, Florin Manea, Vijay Ganesh

Modern SMT solvers, such as Z3, offer user-controllable strategies, enabling users to tailor solving strategies for their unique set of instances, thus dramatically enhancing solver performance for their use case. However, this approach of strategy customization presents a significant challenge: handcrafting an optimized strategy for a class of SMT instances remains a complex and demanding task for both solver developers and users alike. In this paper, we address this problem of automatic SMT strategy synthesis via a novel Monte Carlo Tree Search (MCTS) based method. Our method treats strategy synthesis as a sequential decision-making process, whose search tree corresponds to the strategy space, and employs MCTS to navigate this vast search space. The key innovations that enable our method to identify effective strategies, while keeping costs low, are the ideas of layered and staged MCTS search. These novel heuristics allow for a deeper and more efficient exploration of the strategy space, enabling us to synthesize more effective strategies than the default ones in state-of-the-art (SOTA) SMT solvers. We implement our method, dubbed Z3alpha, as part of the Z3 SMT solver. Through extensive evaluations across six important SMT logics, Z3alpha demonstrates superior performance compared to the SOTA synthesis tool FastSMT, the default Z3 solver, and the CVC5 solver on most benchmarks. Remarkably, on a challenging QF_BV benchmark set, Z3alpha solves 42.7% more instances than the default strategy in the Z3 SMT solver.

5/1/2024

🔎

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, Michael Shieh

We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate remarkable performance improvements over existing models. For instance, our approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and ARC-C, with substantial increases in accuracy to $81.8%$ (+$5.9%$), $34.7%$ (+$5.8%$), and $76.4%$ (+$15.8%$), respectively. Additionally, our research delves into the training and inference compute tradeoff, providing insights into how our method effectively maximizes performance gains. Our code is publicly available at https://github.com/YuxiXie/MCTS-DPO.

6/19/2024