Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning

Read original: arXiv:2305.09840 - Published 9/2/2024 by Stephen Wissow, Masataro Asai

🔮

Overview

Exploration and exploitation is an important problem in game tree search and automated planning.
The planning community has had limited success in applying insights from the Multi-Armed Bandit (MAB) literature to address this problem.
The paper proposes a new algorithm called GreedyUCT-Normal, which improves upon existing Monte Carlo Tree Search (MCTS) and Trial-Based Heuristic Tree Search (THTS) algorithms by better handling different scales of rewards.

Plain English Explanation

When machines are trying to solve complex planning problems, they often face a tradeoff between exploration - trying new approaches to discover better solutions, and exploitation - focusing on what has worked well in the past. Researchers have studied this problem extensively in the context of Multi-Armed Bandits (MABs), which are simplified models of this exploration-exploitation dilemma.

However, the planning community has struggled to apply the insights from MAB research to real-world planning algorithms. The paper explains that a key issue is that popular MAB algorithms like UCB1 make assumptions about the reward distributions that don't match what happens in heuristic-based planning algorithms.

To address this, the paper proposes a new algorithm called GreedyUCT-Normal. This algorithm uses a modified version of the UCB1 bandit algorithm that explicitly takes into account the different scales of the rewards encountered during planning. The result is an improved planning system that can find more plans using fewer computational resources compared to existing approaches.

Technical Explanation

The paper focuses on improving Monte Carlo Tree Search (MCTS) and Trial-Based Heuristic Tree Search (THTS) algorithms for classical planning problems. These algorithms use MAB techniques like UCB1 to balance exploration and exploitation during the search process.

However, the authors note that the theoretical requirements of UCB1 - namely that rewards have a fixed, bounded support - are not satisfied in the heuristic-based reward distributions encountered in classical planning. This leads to suboptimal performance.

To address this, the paper introduces GreedyUCT-Normal, a new MCTS/THTS algorithm that uses a modified UCB1 bandit algorithm called UCB1-Normal. This version of UCB1 explicitly models the different scales of the rewards by taking the reward variance into account.

The authors evaluate GreedyUCT-Normal against existing MCTS/THTS algorithms as well as a Greedy Best First Search baseline. Their results show that GreedyUCT-Normal is able to find more plans using fewer node expansions, demonstrating improved algorithmic performance.

Critical Analysis

The paper provides a useful theoretical analysis of the limitations of applying standard MAB algorithms like UCB1 to heuristic-based planning problems. The proposed GreedyUCT-Normal algorithm appears to be an effective solution to this problem, though the evaluation is limited to classical planning domains.

One potential area for further research would be to examine how well GreedyUCT-Normal generalizes to other types of planning problems, such as probabilistic planning or multi-agent planning. Additionally, a more thorough complexity analysis of the algorithm could help quantify its computational overhead compared to simpler approaches.

Overall, the paper makes a valuable contribution by highlighting an important gap between MAB theory and planning practice, and proposing a novel algorithmic solution to bridge that gap.

Conclusion

This paper demonstrates how a deeper theoretical understanding of Multi-Armed Bandits can lead to improved planning algorithms. By addressing the mismatch between standard MAB assumptions and the realities of heuristic-based planning, the authors were able to develop GreedyUCT-Normal - an MCTS/THTS algorithm with better exploration-exploitation performance.

The implications of this work extend beyond classical planning, as the principles of adaptively handling reward distributions could be applicable to a wide range of sequential decision-making problems. As researchers continue to bridge the gap between bandit theory and real-world applications, we can expect to see further advancements in the field of automated planning and decision-making.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔮

Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning

Stephen Wissow, Masataro Asai

Balancing exploration and exploitation has been an important problem in both game tree search and automated planning. However, while the problem has been extensively analyzed within the Multi-Armed Bandit (MAB) literature, the planning community has had limited success when attempting to apply those results. We show that a more detailed theoretical understanding of MAB literature helps improve existing planning algorithms that are based on Monte Carlo Tree Search (MCTS) / Trial Based Heuristic Tree Search (THTS). In particular, THTS uses UCB1 MAB algorithms in an ad hoc manner, as UCB1's theoretical requirement of fixed bounded support reward distributions is not satisfied within heuristic search for classical planning. The core issue lies in UCB1's lack of adaptations to the different scales of the rewards. We propose GreedyUCT-Normal, a MCTS/THTS algorithm with UCB1-Normal bandit for agile classical planning, which handles distributions with different scales by taking the reward variance into consideration, and resulted in an improved algorithmic performance (more plans found with less node expansions) that outperforms Greedy Best First Search and existing MCTS/THTS-based algorithms (GreedyUCT,GreedyUCT*).

9/2/2024

Extreme Value Monte Carlo Tree Search

Masataro Asai, Stephen Wissow

Despite being successful in board games and reinforcement learning (RL), UCT, a Monte-Carlo Tree Search (MCTS) combined with UCB1 Multi-Armed Bandit (MAB), has had limited success in domain-independent planning until recently. Previous work showed that UCB1, designed for $[0,1]$-bounded rewards, is not appropriate for estimating the distance-to-go which are potentially unbounded in $mathbb{R}$, such as heuristic functions used in classical planning, then proposed combining MCTS with MABs designed for Gaussian reward distributions and successfully improved the performance. In this paper, we further sharpen our understanding of ideal bandits for planning tasks. Existing work has two issues: First, while Gaussian MABs no longer over-specify the distances as $hin [0,1]$, they under-specify them as $hin [-infty,infty]$ while they are non-negative and can be further bounded in some cases. Second, there is no theoretical justifications for Full-Bellman backup (Schulte & Keller, 2014) that backpropagates minimum/maximum of samples. We identified emph{extreme value} statistics as a theoretical framework that resolves both issues at once and propose two bandits, UCB1-Uniform/Power, and apply them to MCTS for classical planning. We formally prove their regret bounds and empirically demonstrate their performance in classical planning.

5/29/2024

Hierarchical Multi-Armed Bandits for the Concurrent Intelligent Tutoring of Concepts and Problems of Varying Difficulty Levels

Blake Castleman, Uzay Macar, Ansaf Salleb-Aouissi

Remote education has proliferated in the twenty-first century, yielding rise to intelligent tutoring systems. In particular, research has found multi-armed bandit (MAB) intelligent tutors to have notable abilities in traversing the exploration-exploitation trade-off landscape for student problem recommendations. Prior literature, however, contains a significant lack of open-sourced MAB intelligent tutors, which impedes potential applications of these educational MAB recommendation systems. In this paper, we combine recent literature on MAB intelligent tutoring techniques into an open-sourced and simply deployable hierarchical MAB algorithm, capable of progressing students concurrently through concepts and problems, determining ideal recommended problem difficulties, and assessing latent memory decay. We evaluate our algorithm using simulated groups of 500 students, utilizing Bayesian Knowledge Tracing to estimate students' content mastery. Results suggest that our algorithm, when turned difficulty-agnostic, significantly boosts student success, and that the further addition of problem-difficulty adaptation notably improves this metric.

8/15/2024

A Contextual Combinatorial Bandit Approach to Negotiation

Yexin Li, Zhancun Mu, Siyuan Qi

Learning effective negotiation strategies poses two key challenges: the exploration-exploitation dilemma and dealing with large action spaces. However, there is an absence of learning-based approaches that effectively address these challenges in negotiation. This paper introduces a comprehensive formulation to tackle various negotiation problems. Our approach leverages contextual combinatorial multi-armed bandits, with the bandits resolving the exploration-exploitation dilemma, and the combinatorial nature handles large action spaces. Building upon this formulation, we introduce NegUCB, a novel method that also handles common issues such as partial observations and complex reward functions in negotiation. NegUCB is contextual and tailored for full-bandit feedback without constraints on the reward functions. Under mild assumptions, it ensures a sub-linear regret upper bound. Experiments conducted on three negotiation tasks demonstrate the superiority of our approach.

7/2/2024