PCL-Indexability and Whittle Index for Restless Bandits with General Observation Models

Read original: arXiv:2307.03034 - Published 7/4/2024 by Keqin Liu, Chengzhong Zhang

PCL-Indexability and Whittle Index for Restless Bandits with General Observation Models

Overview

This paper explores the PCL-indexability and Whittle index for restless bandits with general observation models.
The authors develop a low-complexity algorithm to compute the Whittle index for restless bandits with imperfect observations.
The research builds upon previous work on low-complexity algorithms for restless bandits with imperfect observations and tabular deep learning approaches to computing the Whittle index.

Plain English Explanation

The paper tackles a challenging problem in the field of multi-armed bandits, which is a framework used to model decision-making scenarios where there are multiple options (like slot machines), each with uncertain rewards. The specific type of multi-armed bandit problem addressed is the "restless" bandit problem, where the state of each option can change even when it is not being actively played.

The key contribution of this work is developing a way to efficiently compute the "Whittle index" for restless bandits with general observation models. The Whittle index is a metric that allows you to prioritize which option to play at any given time, in order to maximize the overall rewards. The authors show that their approach is computationally efficient, meaning it can be used in real-world applications where quick decision-making is required.

This work builds on previous research on low-complexity algorithms for restless bandits with imperfect observations and tabular deep learning approaches to computing the Whittle index. By incorporating general observation models, the authors expand the applicability of the Whittle index to a wider range of real-world scenarios, such as those with noisy or incomplete information about the state of each option.

Technical Explanation

The paper presents a framework for analyzing the PCL-indexability and Whittle index of restless bandits with general observation models. PCL-indexability is a property that ensures the Whittle index can be efficiently computed, and the authors establish sufficient conditions for a restless bandit problem to be PCL-indexable.

To compute the Whittle index, the authors develop a low-complexity algorithm that avoids the need to solve a complex optimization problem. This algorithm leverages the structure of the general observation model to derive a set of simple equations that can be solved efficiently. The authors prove the correctness and optimality of their algorithm.

The paper also includes an analysis of the global rewards restless multi-armed bandit problem, where the rewards are generated by a linear function of the states of all the bandits. This extends previous work on restless bandit problems with rewards generated by linear functions of the states.

Critical Analysis

The paper makes a significant contribution to the literature on restless bandits by developing a computationally efficient algorithm for computing the Whittle index under general observation models. This expands the applicability of the Whittle index to a wider range of real-world scenarios, where the state of each option may not be fully observable.

However, the paper does not address the potential limitations of the Whittle index approach, such as the fact that it may not always lead to optimal decisions, especially in complex environments with many options and interdependent states. Additionally, the authors do not discuss how their approach might scale to very large-scale problems with hundreds or thousands of options.

Further research could explore the robustness of the Whittle index approach to model misspecification, as well as investigate alternative decision-making strategies that may outperform the Whittle index in certain scenarios. Comparisons to other state-of-the-art methods, such as deep reinforcement learning approaches to restless linear bandits, could also provide valuable insights.

Conclusion

This paper presents a significant advancement in the field of restless bandits by developing a low-complexity algorithm for computing the Whittle index under general observation models. The authors establish theoretical guarantees on the performance of their algorithm and demonstrate its applicability to a range of real-world scenarios.

The work builds upon and extends previous research on restless bandits, and the authors have made their code publicly available, which should facilitate further advancements in this important area of study. While the Whittle index approach has limitations, this paper represents an important step forward in making efficient decision-making under uncertainty more accessible and practical.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PCL-Indexability and Whittle Index for Restless Bandits with General Observation Models

Keqin Liu, Chengzhong Zhang

In this paper, we consider a general observation model for restless multi-armed bandit problems. The operation of the player needs to be based on certain feedback mechanism that is error-prone due to resource constraints or environmental or intrinsic noises. By establishing a general probabilistic model for dynamics of feedback/observation, we formulate the problem as a restless bandit with a countable belief state space starting from an arbitrary initial belief (a priori information). We apply the achievable region method with partial conservation law (PCL) to the infinite-state problem and analyze its indexability and priority index (Whittle index). Finally, we propose an approximation process to transform the problem into which the AG algorithm of Ni~no-Mora and Bertsimas for finite-state problems can be applied to. Numerical experiments show that our algorithm has an excellent performance.

7/4/2024

🔍

Low-Complexity Algorithm for Restless Bandits with Imperfect Observations

Keqin Liu, Richard Weber, Chengzhong Zhang

We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider $N$ independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (`good' and `bad'). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only $M$ $(<N)$ processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time $t$, a probability that process $i$ is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.

5/14/2024

Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes

Vishesh Mittal, Rahul Meshram, Surya Prakash

We study the Whittle index learning algorithm for restless multi-armed bandits. We consider index learning algorithm with Q-learning. We first present Q-learning algorithm with exploration policies -- epsilon-greedy, softmax, epsilon-softmax with constant stepsizes. We extend the study of Q-learning to index learning for single-armed restless bandit. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. In Q-learning updates are in asynchronous manner. We study constant stepsizes two timescale stochastic approximation algorithm. We provide analysis of two-timescale stochastic approximation for index learning with constant stepsizes. Further, we present study on index learning with deep Q-network (DQN) learning and linear function approximation with state-aggregation method. We describe the performance of our algorithms using numerical examples. We have shown that index learning with Q learning, DQN and function approximations learns the Whittle index.

9/10/2024

🤿

Tabular and Deep Learning for the Whittle Index

Francisco Robledo Rela~no (LMAP, UPPA, UPV / EHU), Vivek Borkar (EE-IIT), Urtzi Ayesta (IRIT-RMESS, UPV/EHU, CNRS), Konstantin Avrachenkov (Inria)

The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.

6/5/2024