HSVI-based Online Minimax Strategies for Partially Observable Stochastic Games with Neural Perception Mechanisms

2404.10679

Published 4/17/2024 by Rui Yan, Gabriel Santos, Gethin Norman, David Parker, Marta Kwiatkowska

HSVI-based Online Minimax Strategies for Partially Observable Stochastic Games with Neural Perception Mechanisms

Abstract

We consider a variant of continuous-state partially-observable stochastic games with neural perception mechanisms and an asymmetric information structure. One agent has partial information, with the observation function implemented as a neural network, while the other agent is assumed to have full knowledge of the state. We present, for the first time, an efficient online method to compute an $varepsilon$-minimax strategy profile, which requires only one linear program to be solved for each agent at every stage, instead of a complex estimation of opponent counterfactual values. For the partially-informed agent, we propose a continual resolving approach which uses lower bounds, pre-computed offline with heuristic search value iteration (HSVI), instead of opponent counterfactual values. This inherits the soundness of continual resolving at the cost of pre-computing the bound. For the fully-informed agent, we propose an inferred-belief strategy, where the agent maintains an inferred belief about the belief of the partially-informed agent based on (offline) upper bounds from HSVI, guaranteeing $varepsilon$-distance to the value of the game at the initial belief known to both agents.

Create account to get full access

Overview

This paper presents a novel approach called NS-HVSI (Neural Perception-based Simultaneous Recurrent Solving) for solving partially observable stochastic games with neural perception mechanisms.
The key idea is to combine Hindsight Succinct Value Iteration (HSVI) with neural networks for perception, enabling efficient online decision-making in complex environments.
The method is evaluated on challenging partially observable matrix games, demonstrating improved performance compared to existing techniques.

Plain English Explanation

This research focuses on a type of game called a "partially observable stochastic game." In these games, players don't have full information about the current state of the game - there's some uncertainty involved. The researchers developed a new approach called NS-HSVI that allows players to make good decisions in these kinds of games.

The core idea is to combine two existing techniques: Hindsight Succinct Value Iteration (HSVI) and neural networks. HSVI is a method for finding optimal strategies in games with incomplete information. Neural networks are a type of machine learning model that can be trained to recognize patterns in data.

By bringing these two things together, the researchers created a system that can efficiently make decisions in complex, uncertain environments. The neural networks handle the perception part - figuring out the current state of the game from limited information. And HSVI uses that input to determine the best move to make.

The researchers tested their approach on some challenging example games, and showed that it outperformed other existing techniques. This suggests the NS-HSVI method could be a useful tool for building intelligent agents that can make good decisions in the face of incomplete or uncertain information.

Technical Explanation

The paper introduces a novel algorithm called NS-HSVI (Neural Perception-based Simultaneous Recurrent Solving) for solving partially observable stochastic games (POSGs) with neural perception mechanisms.

At a high level, NS-HSVI combines Hindsight Succinct Value Iteration (HSVI), an efficient online algorithm for solving POSGs, with neural networks for perceiving the current state of the game from partial observations.

The key technical contributions are:

Neural Perception Module: The authors develop a neural network-based perception module that learns to estimate the current state of the game from partial observations. This allows the system to handle complex, high-dimensional observation spaces.
Simultaneous Recurrent Solving: NS-HSVI performs simultaneous planning and perception, updating the value function and neural network parameters in an interleaved fashion during online decision-making.
Theoretical Analysis: The authors provide a convergence guarantee for the NS-HSVI algorithm, showing that it will converge to an ε-optimal minimax strategy in POSGs.

The method is evaluated on challenging partially observable matrix games, where it demonstrates improved performance compared to existing POSG solvers like AWESOME and NIPS.

Critical Analysis

The paper presents a compelling approach for solving partially observable stochastic games using neural perception and online planning. The key strengths are the tight integration of learning and planning, as well as the theoretical convergence guarantee.

However, a potential limitation is the reliance on a specific parametric form for the value function (based on HSVI). This may limit the expressiveness of the learned strategies, especially in complex environments. An interesting direction for future work could be to explore more flexible value function representations, possibly building on techniques like deep reinforcement learning.

Additionally, the evaluation is focused on relatively simple matrix games. It would be valuable to see how the NS-HSVI method scales to larger, more realistic POSG domains, such as multi-agent robotics or security scenarios. Investigating the sample efficiency and robustness of the approach in such settings would help assess its practical applicability.

Overall, the NS-HSVI algorithm represents an interesting step towards building intelligent agents that can make effective decisions in the face of partial information. Further research exploring more expressive value functions and evaluating the method on larger-scale problems could yield valuable insights for the field of multi-agent reinforcement learning.

Conclusion

This paper presents NS-HSVI, a novel algorithm for solving partially observable stochastic games that combines neural perception and online planning. By integrating HSVI, a principled POSG solver, with a neural network-based state estimation module, the method can efficiently make decisions in complex, uncertain environments.

The key strengths of the approach are the tight coupling of learning and planning, as well as the theoretical guarantees on convergence. While the current evaluation is limited to small-scale matrix games, the underlying ideas could have broader applicability to a range of multi-agent decision-making problems. Further research exploring more expressive value function representations and evaluating the method on larger-scale domains could yield valuable insights for the field of artificial intelligence and multiagent systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📊

Value Approximation for Two-Player General-Sum Differential Games with State Constraints

Lei Zhang, Mukesh Ghimire, Wenlong Zhang, Zhe Xu, Yi Ren

Solving Hamilton-Jacobi-Isaacs (HJI) PDEs numerically enables equilibrial feedback control in two-player differential games, yet faces the curse of dimensionality (CoD). While physics-informed neural networks (PINNs) have shown promise in alleviating CoD in solving PDEs, vanilla PINNs fall short in learning discontinuous solutions due to their sampling nature, leading to poor safety performance of the resulting policies when values are discontinuous due to state or temporal logic constraints. In this study, we explore three potential solutions to this challenge: (1) a hybrid learning method that is guided by both supervisory equilibria and the HJI PDE, (2) a value-hardening method where a sequence of HJIs are solved with increasing Lipschitz constant on the constraint violation penalty, and (3) the epigraphical technique that lifts the value to a higher dimensional state space where it becomes continuous. Evaluations through 5D and 9D vehicle and 13D drone simulations reveal that the hybrid method outperforms others in terms of generalization and safety performance by taking advantage of both the supervisory equilibrium values and costates, and the low cost of PINN loss gradients.

5/8/2024

cs.RO cs.GT cs.LG

🗣️

State-Constrained Zero-Sum Differential Games with One-Sided Information

Mukesh Ghimire, Lei Zhang, Zhe Xu, Yi Ren

We study zero-sum differential games with state constraints and one-sided information, where the informed player (Player 1) has a categorical payoff type unknown to the uninformed player (Player 2). The goal of Player 1 is to minimize his payoff without violating the constraints, while that of Player 2 is to violate the state constraints if possible, or to maximize the payoff otherwise. One example of the game is a man-to-man matchup in football. Without state constraints, Cardaliaguet (2007) showed that the value of such a game exists and is convex to the common belief of players. Our theoretical contribution is an extension of this result to games with state constraints and the derivation of the primal and dual subdynamic principles necessary for computing behavioral strategies. Different from existing works that are concerned about the scalability of no-regret learning in games with discrete dynamics, our study reveals the underlying structure of strategies for belief manipulation resulting from information asymmetry and state constraints. This structure will be necessary for scalable learning on games with continuous actions and long time windows. We use a simplified football game to demonstrate the utility of this work, where we reveal player positions and belief states in which the attacker should (or should not) play specific random deceptive moves to take advantage of information asymmetry, and compute how the defender should respond.

6/5/2024

cs.GT cs.LG

🤿

Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

Songtao Feng, Ming Yin, Yu-Xiang Wang, Jing Yang, Yingbin Liang

The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an $epsilon$-optimal Nash Equilibrium (NE) with the sample complexity of $O(H^3SAB/epsilon^2)$, which is optimal in the dependence of the horizon $H$ and the number of states $S$ (where $A$ and $B$ denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the $H$ dependence as model-based algorithms. The main improvement of the dependency on $H$ arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.

6/7/2024

cs.LG cs.AI cs.GT stat.ML

🎯

ApproxED: Approximate exploitability descent via learned best responses

Carlos Martin, Tuomas Sandholm

There has been substantial progress on finding game-theoretic equilibria. Most of that work has focused on games with finite, discrete action spaces. However, many games involving space, time, money, and other fine-grained quantities have continuous action spaces (or are best modeled as having such). We study the problem of finding an approximate Nash equilibrium of games with continuous action sets. The standard measure of closeness to Nash equilibrium is exploitability, which measures how much players can benefit from unilaterally changing their strategy. We propose two new methods that minimize an approximation of exploitability with respect to the strategy profile. The first method uses a learned best-response function, which takes the current strategy profile as input and outputs candidate best responses for each player. The strategy profile and best-response functions are trained simultaneously, with the former trying to minimize exploitability while the latter tries to maximize it. The second method maintains an ensemble of candidate best responses for each player. In each iteration, the best-performing elements of each ensemble are used to update the current strategy profile. The strategy profile and ensembles are simultaneously trained to minimize and maximize the approximate exploitability, respectively. We evaluate our methods on various continuous games and GAN training, showing that they outperform prior methods.

6/14/2024

cs.GT cs.AI cs.LG cs.MA