Policy-Space Search: Equivalences, Improvements, and Compression

Read original: arXiv:2403.19883 - Published 4/1/2024 by Frederico Messa, Andr'e Grahl Pereira

🔎

Introduction

The paper introduces AND* (A* with Non-Determinism), a sound and complete algorithm that searches for a policy solution in the space of policies for Fully Observable Non-Deterministic (FOND) planning tasks. FOND models uncertainty through actions with non-deterministic effects.

AND* starts with an empty policy and uses a priority queue to explore incomplete candidate policies until finding a solution. When a chosen policy is not a solution, AND* replaces it with successor policies in the queue. With a suitable heuristic, AND* can find a minimal mapping policy solution.

The paper studies policy equivalences to avoid expanding multiple equivalent candidate policies, improving performance. It introduces rules to determine if two policies are equivalent for finding a compact solution. A polynomial-time "concretizer" procedure constructs a solution policy knowing only its states, enabling stronger equivalence rules.

State equivalences based on structural symmetries are also computed to further strengthen policy equivalence detection. Techniques from group theory literature improve symmetry computation.

The impacts of early deadlock pruning and satisficing search strategies like greedy best-first search are also studied. Variants WAND* (Weighted AND*) and GBFSND (Greedy Best-First Search with Non-Determinism) are introduced.

Finally, a "compressor" procedure compresses solution policies into compact representations over partial states while preserving equivalence. This allows AND* solutions to achieve state-of-the-art compactness on benchmarks.

The paper covers background, policy/state equivalence concepts, satisficing techniques, the compressor, empirical analysis against other planners, and conclusions.

Background on FOND Planning and Policy-Space Search

The provided section describes the FOND (Fully Observable Non-Deterministic) planning task formulation and the AND* algorithm for solving FOND tasks. It starts by formally defining the components of a FOND task, including the initial state, goal condition, actions with non-deterministic effects, states, transitions, trajectories, policies, and solution policies. It then presents the AND* algorithm, which is a best-first heuristic search algorithm that explores the policy space to find a solution policy for a given FOND task. The algorithm uses an admissible and goal-aware heuristic function to guide the search.

The section then introduces a specific admissible and goal-aware heuristic called Δ↓, which combines information from a classical planning heuristic (hLM-Cut) and the size of the remaining part of the policy that needs to be defined. It provides an example of computing the Δ↓ heuristic for a simple FOND task.

Finally, the section describes the benchmarks and experimental settings used for evaluating the performance of the AND* algorithm and its variants. It mentions two benchmark sets, IPC-FOND and NEW-FOND, with a total of 16 planning domains and hundreds of tasks. It also explains the experimental setup, including the hardware, time and memory limits, and additional configurations for tie-breaking and handling goal states.

Policy Equivalence Pruning

The provided section discusses different equivalence pruning concepts for the AND* algorithm to solve fully observable nondeterministic (FOND) planning problems. It first introduces the concept of equivalence pruning and the signatures used to define policy equivalence. The section then covers several equivalence pruning approaches:

Lanes equivalence pruning: Two policies are considered equivalent if they have the same domain (set of mapped states) and each mapped state reaches the same set of unmapped states. This approach ensures soundness, completeness, and optimality.
Domain-frontier equivalence pruning: Initially, this approach was unsound as two policies with the same domain and frontier (set of unmapped states) could differ in whether they lead to a solution. The paper introduces a concretizer algorithm that can construct a proper solution policy from a hollow policy (with correct domain and frontier) if one exists. Using the concretizer makes domain-frontier pruning sound, complete, and optimal.
Frontier equivalence pruning: Considering just the frontier for equivalence is unsound as the paper shows examples where pruning based on frontier equivalence can discard all solution-leading policies. However, the algorithm never returns false positives (non-solution policies) when using frontier pruning. The paper suggests running frontier pruning first and if it fails, re-running with domain-frontier pruning as a backup to ensure soundness and completeness (sacrificing optimality).

The section provides empirical results comparing the different pruning approaches, showing that frontier pruning achieves the highest coverage and lowest number of generated policies, followed by domain-frontier and then lanes pruning. The use of the concretizer is key to making domain-frontier pruning sound without requiring the backup domain-frontier search needed for frontier pruning.

Improving Equivalence Pruning with State Symmetries

The passage discusses techniques to strengthen the pruning effect of policy frontier equivalence in AND/OR search for fully observable non-deterministic planning using state symmetries. It defines structural state symmetries as permutations that preserve goal states, applicable actions, and successor states. It presents two approaches to compute these symmetries - an approximated greedy approach and a canonical approach using group theory techniques from prior work.

The empirical evaluation shows that a significant portion of benchmark tasks exhibit structural symmetries. The canonical approach can efficiently compute the symmetries by leveraging stabilizer chains of the automorphism group. While slightly slower than the greedy approach, the canonical symmetries provide better pruning. Using state symmetries improves coverage from 65% to 73% over not using symmetries. Several domains see dramatic reductions (up to 99.95%) in the number of generated policies when exploiting symmetries, especially with the canonical approach. Major coverage gains occur in the faults and miner domains when using canonical symmetries.

Overall, the work demonstrates how leveraging structural state symmetries, computed via group theory techniques, can significantly improve the performance of AND/OR search algorithms for fully observable non-deterministic planning.

Search Improvements

The paper evaluates satisficing search techniques to increase the planner's ability to find solutions for FOND (Fully Observable Non-Deterministic) planning tasks.

5.1 Deadlock Detection:

The authors examine the impact of enabling deadlock detection in the AND* algorithm. Deadlock detection allows early discard of policies that certainly won't lead to a solution.
Enabling deadlock detection has a positive impact, especially for the "tireworld-truck" domain, increasing overall coverage from 0.73 to 0.76.

5.2 Sub-Optimal Search:

The authors evaluate two sub-optimal search variants: 2-WAND* (Weighted A* with a weight of 2) and GBFSND (Greedy Best-First Search for Non-Determinism).
2-WAND* improves coverage from 0.76 to 0.85, with major gains in several domains like blocksworld and first-responders. However, it generates fewer policies, dramatically reducing the search effort.
GBFSND decreases overall coverage from 0.76 to 0.69 but interestingly increases coverage for the "earth-observation" domain to 1.0. It generates much larger policies on average.

The authors decide to use 2-WAND* with deadlock detection for the remainder of their work, as it improves overall coverage while maintaining reasonable policy sizes.

Solution Compressing

This section introduces a procedure called the compressor that allows compressing solution policies represented over states into more compact policies represented over partial states. The key ideas are:

A partial state represents a set of states satisfying certain properties over truth valuations of facts. A partial policy maps partial states to actions.
The compressor uses an integer program (IP) to find a minimal set of partial states that covers all states mapped to each action by the original state-based policy, without covering any other states.
It subdivides the problem into subproblems for each action and uses the IP iteratively with increasing numbers of partial states until a solution is found.
The IP models the constraints that each covered state is represented by some partial state, and no uncovered state is represented. It aims to minimize the total number of partial state to fact mappings.
An empirical evaluation shows the compressor can significantly reduce policy sizes for several benchmark domains, with the doors domain seeing up to 3 orders of magnitude reduction.

The compressor allows converting any state-based policy into a compact partial-state representation without losing information, enabling more concise policy representations. The IP formulation and iterative approach are key to making this compression procedure efficient in practice.

Comparison with the State of the Art

The section compares the performance of the proposed planner AND*^* with other state-of-the-art FOND planners - PRP, FOND-SAT, and IDFSP.

It provides an overview of how each of the compared planners works. PRP uses classical planning to find trajectories and generalizes them into partial policies. FOND-SAT compiles FOND tasks into SAT instances and extracts policies from the solutions. IDFSP performs an iterative depth-first implicit policy search.

The results show that AND*^*, PRP, and IDFSP have similar coverage (around 80-85% tasks solved), while FOND-SAT lags behind at 43% coverage. If a hybrid planner running each of the four in sequence is used, perfect coverage can be achieved.

Regarding solution sizes, AND*^* returns solutions at least 10% smaller than PRP in 6 domains. Compared to FOND-SAT, the solution sizes are very similar, with FOND-SAT slightly smaller in 7 domains and AND*^* smaller in 1 domain.

The section does not provide any text to summarize for the domains where the ratios don't add up to 100% for PRP or the memory limit failures for FOND-SAT.

Conclusion and Future Work

The paper studied policy-space equivalence pruning for a FOND (Fully Observable Non-Deterministic) planner called AND*^* that generalizes the A*^* algorithm. Three different features were presented to derive equivalences between policies, allowing substantial improvement in the policy-space search performed by AND*^*. A key procedure called the concretizer was introduced to find correct mappings of a solution in polynomial time if the states are known.

The concepts of policy equivalences provide different guarantees and empirical improvements to AND*^*'s search. The paper also investigated symmetries over states for stronger policy pruning and used group theory techniques for improved symmetry computation. Giving reasonable extra weight to heuristic values was found beneficial, while greedy best-first policy-space search was detrimental.

A solution compressor was presented to find a policy representing a given solution over complete states using the minimum number of partial states. The compressor allows compression of solutions from any FOND planner and is efficient for the benchmark tasks solved.

Overall, the work advances policy-space explicit heuristic search performance by better understanding FOND policy structure, incorporating symmetry detection techniques, evaluating satisficing search methods, and compressing solutions. AND*^* with these improvements proved competitive against other state-of-the-art FOND planners in coverage and solution compactness.

Future work includes studying better heuristics, using partial states for search compression, incorporating dead-end generalization from PRP, and exploring the concretizer's potential for deducing solutions from domain sets.

Acknowledgments

This section acknowledges funding sources that supported the research presented in the paper. The authors express gratitude to several Brazilian organizations for partially funding the work, including UFRGS, CNPq (the National Council for Scientific and Technological Development), CAPES (the Coordination for the Improvement of Higher Education Personnel), and FAPERGS (the Research Support Foundation of the State of Rio Grande do Sul). Specifically, CNPq provided general support, FAPERGS funded a specific project (21/2551-0000741-9), and CAPES provided financial assistance through its Finance Code 001 program. The authors also thank the authors of a previous study (Winterer et al., 2016) for making their code publicly available.

Appendix A Proofs for AND{}^{}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT with Equivalence Pruning

This section introduces theoretical guarantees for the AND* algorithm under different settings of equivalence pruning. Key points:

Proving AND* always terminates and never returns a non-solution policy (no false positives) is straightforward.
Proving AND* always returns an (optimal) solution when one exists (no false negatives) is challenging for most settings.
The concept of (optimal) certificate pairs is presented to prove the absence of false negatives under certain conditions.
Certificate pairs ensure the algorithm only stops expanding a policy when a solution is found or there exists a better successor policy.
Optimal certificate pairs additionally ensure the solution found is optimal according to a given heuristic and cost metric.
Theoretical results are provided for three equivalence pruning signatures: identity, lanes, and domain-frontier.
For identity pruning, AND* is proven sound, complete and optimal when using an admissible and goal-aware heuristic.
For lanes pruning, AND* is proven sound, complete and optimal for the domain size cost metric when using an admissible optimal-size heuristic.
For domain-frontier pruning with the concretizer, AND* is proven sound, complete and optimal for the domain size cost when using an admissible optimal-size heuristic with extended awareness.

The concretizer procedure used in domain-frontier pruning is also analyzed theoretically to establish its termination, soundness and completeness guarantees.

Appendix B Concretizer Theoretical Proofs

This section presents several lemmas and a theorem related to the soundness and completeness of the "concretizer" algorithm for finding proper policies in classical planning problems.

Lemma 6 proves that the concretizer algorithm always terminates after a finite number of iterations.

Lemma 7 shows that if the concretizer returns a policy π, then π is proper, has the correct domain D, and the frontier states front(π) are contained in the goal set F.

Lemma 8 proves that the concretizer is sound when there is no proper policy with domain D and frontier in F.

Lemma 9 proves that the concretizer is sound when there exists a proper policy with domain D and frontier in F.

Theorem 4 combines the previous lemmas to state that the concretizer is sound and complete.

The section also includes a glossary defining the various symbols and functions used, such as the state set S, action set A, successor function succs, policies π, their domains dom(π), reachable states reach(π), frontier states front(π), and operations like slice on policies.

Appendix C Glossary of Symbols and Functions

The provided text defines various notations and concepts related to automated planning. Here is a summary:

ℱ is the set of all possible facts or propositions that can be true or false in a state. ℱ[s] represents the set of facts that are true in a particular state s.

𝒮 is the set of all possible states. 𝒮* is the set of goal states we want to reach.

𝒜 is the set of available actions that can transition between states.

succs(s,a) gives the set of successor states that may result from applying action a in state s.

A policy π maps non-goal states to actions. π[s] is the action prescribed for state s by policy π.

Various set notations are defined related to a policy π:

dom(π) is the set of states mapped by π
reach(π) is the set of states reachable following π
front(π) are the frontier states reachable but not mapped by π
remain(π) are the non-goal frontier states

reach(π,s) and escape(π,s) give reachable and frontier states from a specific state s following π.

slice(π,X) produces a new policy π' that agrees with π on states in set X but has no mappings outside X.

A proper policy has all reachable states mapped or leading to goal states. The remain set being empty indicates π is a valid solution policy.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Policy-Space Search: Equivalences, Improvements, and Compression

Frederico Messa, Andr'e Grahl Pereira

Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. A* with Non-Determinism (AND*) (Messa and Pereira, 2023) is a FOND planner that generalizes A* (Hart et al., 1968) for FOND planning. It searches for a solution policy by performing an explicit heuristic search on the policy space of the FOND task. In this paper, we study and improve the performance of the policy-space search performed by AND*. We present a polynomial-time procedure that constructs a solution policy given just the set of states that should be mapped. This procedure, together with a better understanding of the structure of FOND policies, allows us to present three concepts of equivalences between policies. We use policy equivalences to prune part of the policy search space, making AND* substantially more effective in solving FOND tasks. We also study the impact of taking into account structural state-space symmetries to strengthen the detection of equivalence policies and the impact of performing the search with satisficing techniques. We apply a recent technique from the group theory literature to better compute structural state-space symmetries. Finally, we present a solution compressor that, given a policy defined over complete states, finds a policy that unambiguously represents it using the minimum number of partial states. AND* with the introduced techniques generates, on average, two orders of magnitude fewer policies to solve FOND tasks. These techniques allow explicit policy-space search to be competitive in terms of both coverage and solution compactness with other state-of-the-art FOND planners.

4/1/2024

🌐

Learning Generalized Policies for Fully Observable Non-Deterministic Planning Domains

Till Hofmann, Hector Geffner

General policies represent reactive strategies for solving large families of planning problems like the infinite collection of solvable instances from a given domain. Methods for learning such policies from a collection of small training instances have been developed successfully for classical domains. In this work, we extend the formulations and the resulting combinatorial methods for learning general policies over fully observable, non-deterministic (FOND) domains. We also evaluate the resulting approach experimentally over a number of benchmark domains in FOND planning, present the general policies that result in some of these domains, and prove their correctness. The method for learning general policies for FOND planning can actually be seen as an alternative FOND planning method that searches for solutions, not in the given state space but in an abstract space defined by features that must be learned as well.

5/14/2024

🖼️

The Update-Equivalence Framework for Decision-Time Planning

Samuel Sokota, Gabriele Farina, David J. Wu, Hengyuan Hu, Kevin A. Wang, J. Zico Kolter, Noam Brown

The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess and Go. A recent line of work has extended decision-time planning to imperfect-information games, leading to superhuman performance in poker. However, these methods involve solving subgames whose sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue, we introduce an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence. In this update-equivalence framework, decision-time planning algorithms replicate the updates of last-iterate algorithms, which need not rely on public information. This facilitates scalability to games with large amounts of non-public information. Using this framework, we derive a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent. We validate the performance of these algorithms in cooperative and adversarial domains, notably in Hanabi, the standard benchmark for search in fully cooperative imperfect-information games. Here, our mirror descent approach exceeds or matches the performance of public information-based search while using two orders of magnitude less search time. This is the first instance of a non-public-information-based algorithm outperforming public-information-based approaches in a domain they have historically dominated.

5/14/2024

🛠️

Combinatorial Optimization with Policy Adaptation using Latent Space Search

Felix Chalumeau, Shikha Surana, Clement Bonnet, Nathan Grinsztajn, Arnu Pretorius, Alexandre Laterre, Thomas D. Barrett

Combinatorial Optimization underpins many real-world applications and yet, designing performant algorithms to solve these complex, typically NP-hard, problems remains a significant research challenge. Reinforcement Learning (RL) provides a versatile framework for designing heuristics across a broad spectrum of problem domains. However, despite notable progress, RL has not yet supplanted industrial solvers as the go-to solution. Current approaches emphasize pre-training heuristics that construct solutions but often rely on search procedures with limited variance, such as stochastically sampling numerous solutions from a single policy or employing computationally expensive fine-tuning of the policy on individual problem instances. Building on the intuition that performant search at inference time should be anticipated during pre-training, we propose COMPASS, a novel RL approach that parameterizes a distribution of diverse and specialized policies conditioned on a continuous latent space. We evaluate COMPASS across three canonical problems - Travelling Salesman, Capacitated Vehicle Routing, and Job-Shop Scheduling - and demonstrate that our search strategy (i) outperforms state-of-the-art approaches on 11 standard benchmarking tasks and (ii) generalizes better, surpassing all other approaches on a set of 18 procedurally transformed instance distributions.

5/29/2024

Policy-Space Search: Equivalences, Improvements, and Compression

Introduction

Background on FOND Planning and Policy-Space Search

Policy Equivalence Pruning

Improving Equivalence Pruning with State Symmetries

Search Improvements

Solution Compressing

Comparison with the State of the Art

Conclusion and Future Work

Acknowledgments

Appendix A Proofs for AND*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT with Equivalence Pruning

Appendix B Concretizer Theoretical Proofs

Appendix C Glossary of Symbols and Functions

Related Papers

Policy-Space Search: Equivalences, Improvements, and Compression

Learning Generalized Policies for Fully Observable Non-Deterministic Planning Domains

The Update-Equivalence Framework for Decision-Time Planning

Combinatorial Optimization with Policy Adaptation using Latent Space Search

Appendix A Proofs for AND{}^{}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT with Equivalence Pruning