Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical System

2405.09584

Published 5/24/2024 by Jonathan Gornet, Bruno Sinopoli

Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical System

Abstract

Decision-making under uncertainty is a fundamental problem encountered frequently and can be formulated as a stochastic multi-armed bandit problem. In the problem, the learner interacts with an environment by choosing an action at each round, where a round is an instance of an interaction. In response, the environment reveals a reward, which is sampled from a stochastic process, to the learner. The goal of the learner is to maximize cumulative reward. In this work, we assume that the rewards are the inner product of an action vector and a state vector generated by a linear Gaussian dynamical system. To predict the reward for each action, we propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward. We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward, i.e. the reward sampled for action 1 at round $t-1$ can be used for predicting the reward for action $2$ at round $t$. This is accomplished by designing a modified Kalman filter with a matrix representation that can be learned for reward prediction. Numerical evaluations are carried out on a set of linear Gaussian dynamical systems and are compared with 2 other well-known stochastic multi-armed bandit algorithms.

Create account to get full access

Overview

This paper investigates the "restless bandit problem" where reward distributions for multiple "arms" (options) are governed by a linear Gaussian dynamical system.
The authors propose a low-complexity algorithm that can efficiently solve this problem with provable performance guarantees.
The paper builds on previous research in low-complexity algorithms for restless bandits with imperfect observations, provably efficient reinforcement learning for adversarial restless multi-armed bandits, and nonstationary reinforcement learning with linear function approximation.

Plain English Explanation

The "restless bandit problem" is a type of decision-making challenge where an agent (like a software system) must choose which "arm" (option) to pull from a set of options, each of which has a reward distribution that changes over time. Imagine a row of slot machines, where the payout from each machine fluctuates unpredictably. The agent's goal is to maximize the total rewards earned by strategically choosing which machines to play.

In this paper, the authors assume the reward distributions are governed by a linear Gaussian dynamical system, meaning the rewards can be modeled using a set of linear equations with some random noise. They propose a new algorithm that can efficiently solve this problem while providing mathematical guarantees about its performance.

This research builds on previous work in related areas, including algorithms for restless bandits with imperfect observations (where the agent can't perfectly measure the current state), provably efficient reinforcement learning for adversarial restless multi-armed bandits (where the environment actively tries to foil the agent), and nonstationary reinforcement learning with linear function approximation (where the reward distributions change over time in a more complex way).

Technical Explanation

The authors formulate the "restless bandit problem with rewards generated by a linear Gaussian dynamical system" as a Markov decision process, where the state of each "arm" (option) evolves according to a linear Gaussian model. They propose a low-complexity algorithm called the "Optimistic Restless Bandit (ORB)" algorithm that can efficiently solve this problem.

The ORB algorithm maintains an estimate of the current state of each arm and uses an "optimistic" approach to decide which arm to pull. Specifically, it chooses the arm that has the highest potential for future rewards, based on the current state estimates and the linear Gaussian dynamics. The authors prove that ORB achieves near-optimal performance in terms of cumulative rewards, with regret (the difference between the optimal and achieved rewards) that scales logarithmically with the number of time steps.

The paper builds on previous research in related areas, including low-complexity algorithms for restless bandits with imperfect observations, provably efficient reinforcement learning for adversarial restless multi-armed bandits, nonstationary reinforcement learning with linear function approximation, sequential Monte Carlo bandits, and tabular deep reinforcement learning with Gittins index.

Critical Analysis

The paper provides a compelling solution to the restless bandit problem with linear Gaussian dynamics, but there are a few potential limitations and areas for further research:

The authors assume a specific form of the linear Gaussian dynamics, which may not always hold in real-world applications. Relaxing this assumption and developing more general algorithms could broaden the applicability of this work.
The performance guarantees are asymptotic in nature, meaning they hold as the number of time steps approaches infinity. Analyzing the finite-time behavior of the ORB algorithm could provide more practical insights.
The paper does not consider the case where the agent has constraints on the number of arms it can pull per time step, or where the arms have different costs associated with pulling them. Extending the algorithm to handle such constraints could make it more applicable to real-world decision-making scenarios.
The authors do not discuss the computational complexity of the ORB algorithm in detail. Understanding the scalability of the approach as the number of arms or the state space size increases would be valuable.

Overall, this paper presents a significant contribution to the field of restless bandit problems, but further research is needed to address these limitations and explore the broader implications of the proposed approach.

Conclusion

This paper introduces a low-complexity algorithm, called the Optimistic Restless Bandit (ORB) algorithm, for solving the restless bandit problem with rewards generated by a linear Gaussian dynamical system. The authors prove that ORB achieves near-optimal performance in terms of cumulative rewards, with regret that scales logarithmically with the number of time steps.

The research builds on and advances previous work in related areas, including algorithms for restless bandits with imperfect observations, provably efficient reinforcement learning for adversarial restless multi-armed bandits, and nonstationary reinforcement learning with linear function approximation. While the paper presents a compelling solution, there are opportunities for further research to relax the assumptions, analyze finite-time behavior, and explore extensions to handle additional constraints.

The restless bandit problem with linear Gaussian dynamics is a fundamental challenge in sequential decision-making, with applications in areas like adaptive resource allocation, online recommendation systems, and control of dynamical systems. The ORB algorithm proposed in this paper represents a significant step forward in our ability to solve these types of problems efficiently and with strong theoretical guarantees.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

Restless Linear Bandits

Azadeh Khaleghi

A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown $mathbb{R}^d$-valued stationary $varphi$-mixing sequence of parameters $(theta_t,~t in mathbb{N})$ which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the $varphi$-dependence between consecutive $theta_t$. An optimistic algorithm, called LinMix-UCB, is proposed for the case where $theta_t$ has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of $mathcal{O}left(sqrt{d nmathrm{polylog}(n) }right)$ with respect to an oracle that always plays a multiple of $mathbb{E}theta_t$. The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of $mathbb{E}theta_t$.

5/20/2024

stat.ML cs.IT cs.LG

🔍

Low-Complexity Algorithm for Restless Bandits with Imperfect Observations

Keqin Liu, Richard Weber, Chengzhong Zhang

We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider $N$ independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (`good' and `bad'). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only $M$ $(<N)$ processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time $t$, a probability that process $i$ is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.

5/14/2024

cs.LG

🏅

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Guojun Xiong, Jian Li

Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems under an instantaneous activation constraint that at most B arms can be activated at any decision epoch. Each restless arm is endowed with a state that evolves independently according to a Markov decision process regardless of being activated or not. In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards, which can change arbitrarily across episodes. Further, we consider a challenging but natural bandit feedback setting that only adversarial rewards of activated arms are revealed to the decision maker (DM). The goal of the DM is to maximize its total adversarial rewards during the learning process while the instantaneous activation constraint must be satisfied in each decision epoch. We develop a novel reinforcement learning algorithm with two key contributors: a novel biased adversarial reward estimator to deal with bandit feedback and unknown transitions, and a low-complexity index policy to satisfy the instantaneous activation constraint. We show $tilde{mathcal{O}}(Hsqrt{T})$ regret bound for our algorithm, where $T$ is the number of episodes and $H$ is the episode length. To our best knowledge, this is the first algorithm to ensure $tilde{mathcal{O}}(sqrt{T})$ regret for adversarial RMAB in our considered challenging settings.

5/3/2024

cs.LG cs.AI stat.ML

🛠️

Optimal Best Arm Identification with Fixed Confidence in Restless Bandits

P. N. Karthik, Vincent Y. F. Tan, Arpan Mukherjee, Ali Tajer

We study best arm identification in a restless multi-armed bandit setting with finitely many arms. The discrete-time data generated by each arm forms a homogeneous Markov chain taking values in a common, finite state space. The state transitions in each arm are captured by an ergodic transition probability matrix (TPM) that is a member of a single-parameter exponential family of TPMs. The real-valued parameters of the arm TPMs are unknown and belong to a given space. Given a function $f$ defined on the common state space of the arms, the goal is to identify the best arm -- the arm with the largest average value of $f$ evaluated under the arm's stationary distribution -- with the fewest number of samples, subject to an upper bound on the decision's error probability (i.e., the fixed-confidence regime). A lower bound on the growth rate of the expected stopping time is established in the asymptote of a vanishing error probability. Furthermore, a policy for best arm identification is proposed, and its expected stopping time is proved to have an asymptotic growth rate that matches the lower bound. It is demonstrated that tracking the long-term behavior of a certain Markov decision process and its state-action visitation proportions are the key ingredients in analyzing the converse and achievability bounds. It is shown that under every policy, the state-action visitation proportions satisfy a specific approximate flow conservation constraint and that these proportions match the optimal proportions dictated by the lower bound under any asymptotically optimal policy. The prior studies on best arm identification in restless bandits focus on independent observations from the arms, rested Markov arms, and restless Markov arms with known arm TPMs. In contrast, this work is the first to study best arm identification in restless bandits with unknown arm TPMs.

6/26/2024

stat.ML cs.IT cs.LG