Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

2405.00950

Published 5/3/2024 by Guojun Xiong, Jian Li

🏅

Abstract

Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems under an instantaneous activation constraint that at most B arms can be activated at any decision epoch. Each restless arm is endowed with a state that evolves independently according to a Markov decision process regardless of being activated or not. In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards, which can change arbitrarily across episodes. Further, we consider a challenging but natural bandit feedback setting that only adversarial rewards of activated arms are revealed to the decision maker (DM). The goal of the DM is to maximize its total adversarial rewards during the learning process while the instantaneous activation constraint must be satisfied in each decision epoch. We develop a novel reinforcement learning algorithm with two key contributors: a novel biased adversarial reward estimator to deal with bandit feedback and unknown transitions, and a low-complexity index policy to satisfy the instantaneous activation constraint. We show $tilde{mathcal{O}}(Hsqrt{T})$ regret bound for our algorithm, where $T$ is the number of episodes and $H$ is the episode length. To our best knowledge, this is the first algorithm to ensure $tilde{mathcal{O}}(sqrt{T})$ regret for adversarial RMAB in our considered challenging settings.

Create account to get full access

Overview

This paper proposes a novel reinforcement learning algorithm for solving the Restless Multi-Armed Bandit (RMAB) problem in an episodic setting with unknown transition functions and adversarial rewards.
The key challenges addressed include satisfying an instantaneous activation constraint, where at most B arms can be activated at any decision epoch, and dealing with bandit feedback where only the rewards of activated arms are revealed.
The algorithm developed in this paper achieves a regret bound of $\tilde{\mathcal{O}}(H\sqrt{T})$, where T is the number of episodes and H is the episode length, which is the first algorithm to ensure $\tilde{\mathcal{O}}(\sqrt{T})$ regret for adversarial RMAB in this challenging setting.

Plain English Explanation

The paper focuses on a problem called the Restless Multi-Armed Bandit (RMAB), which is a type of sequential decision-making problem. In this problem, there are multiple "arms" (options) that can be "pulled" (activated) at each decision point, but there is a constraint that at most B arms can be activated at any given time.

Each arm has a hidden state that evolves over time, even when the arm is not activated. The goal is to learn which arms to activate at each decision point in order to maximize the total rewards obtained over time, even when the rewards are adversarial (can change arbitrarily from one episode to the next) and the underlying transition functions of the arms are unknown.

This is a challenging problem because the decision-maker only gets to see the rewards of the arms that were actually activated, not the rewards of all the arms (known as "bandit feedback"). The paper proposes a novel algorithm that uses a biased reward estimator and a low-complexity index policy to tackle this problem, and it proves that the algorithm achieves a $\tilde{\mathcal{O}}(H\sqrt{T})$ regret bound, which is the best known result for this setting.

Technical Explanation

The paper considers the problem of learning in episodic Restless Multi-Armed Bandits (RMAB) with unknown transition functions and adversarial rewards. In this setting, there are multiple "arms" (options), each with a state that evolves over time according to a Markov decision process, regardless of whether the arm is activated or not. The decision-maker (DM) must activate at most B arms at each decision epoch, while trying to maximize the total adversarial rewards obtained during the learning process.

The key challenges addressed in this paper are:

Satisfying the instantaneous activation constraint of at most B arms per decision epoch.
Dealing with bandit feedback, where only the rewards of activated arms are revealed to the DM.
Learning in the presence of unknown transition functions and adversarial rewards.

The authors develop a novel reinforcement learning algorithm that has two main components:

A biased adversarial reward estimator to handle the bandit feedback and unknown transitions.
A low-complexity index policy to satisfy the instantaneous activation constraint.

The authors prove that their algorithm achieves a $\tilde{\mathcal{O}}(H\sqrt{T})$ regret bound, where T is the number of episodes and H is the episode length. This is the first algorithm to ensure $\tilde{\mathcal{O}}(\sqrt{T})$ regret for adversarial RMAB in this challenging setting, which includes multi-agent bandit learning, robust multi-agent reinforcement learning, and contextual dueling bandits as special cases.

Critical Analysis

The paper presents a well-designed algorithm and a thorough theoretical analysis to address the challenging RMAB problem with unknown transitions and adversarial rewards. The regret bound achieved by the proposed algorithm is the best known for this setting, which is a significant contribution to the field.

However, the paper does not discuss the practical implementation and scalability of the algorithm. The computational complexity of the index policy and the reward estimation procedure may be a concern for large-scale problems. Additionally, the paper does not provide any empirical evaluation or comparison with existing approaches, which would have been helpful to understand the algorithm's performance in real-world scenarios.

Furthermore, the paper assumes that the arms' transition functions are completely unknown, which may not always be the case in practical applications. It would be interesting to see how the algorithm's performance might change if some partial information about the transition functions is available.

Finally, the paper does not address the exploration-exploitation tradeoff in the RMAB setting, which is a critical aspect of reinforcement learning problems. Investigating the interplay between the instantaneous activation constraint and the exploration-exploitation tradeoff could be a promising direction for future research.

Conclusion

This paper presents a novel reinforcement learning algorithm for solving the Restless Multi-Armed Bandit (RMAB) problem in an episodic setting with unknown transition functions and adversarial rewards. The key contributions of the paper are the development of a biased adversarial reward estimator and a low-complexity index policy to address the challenges of bandit feedback and the instantaneous activation constraint, respectively.

The theoretical analysis shows that the proposed algorithm achieves a regret bound of $\tilde{\mathcal{O}}(H\sqrt{T})$, which is the best known result for this challenging setting. This work has the potential to impact various applications that can be modeled as RMAB problems, such as sequential Monte Carlo bandits, multi-agent reinforcement learning, and contextual dueling bandits. Future research could focus on improving the practical aspects of the algorithm, exploring partial information about the transition functions, and investigating the exploration-exploitation tradeoff in the RMAB setting.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Global Rewards in Restless Multi-Armed Bandits

Naveen Raman, Zheyuan Ryan Shi, Fei Fang

Restless multi-armed bandits (RMAB) extend multi-armed bandits so pulling an arm impacts future states. Despite the success of RMABs, a key limiting assumption is the separability of rewards into a sum across arms. We address this deficiency by proposing restless-multi-armed bandit with global rewards (RMAB-G), a generalization of RMABs to global non-separable rewards. To solve RMAB-G, we develop the Linear- and Shapley-Whittle indices, which extend Whittle indices from RMABs to RMAB-Gs. We prove approximation bounds but also point out how these indices could fail when reward functions are highly non-linear. To overcome this, we propose two sets of adaptive policies: the first computes indices iteratively, and the second combines indices with Monte-Carlo Tree Search (MCTS). Empirically, we demonstrate that our proposed policies outperform baselines and index-based policies with synthetic data and real-world data from food rescue.

6/11/2024

cs.LG cs.AI cs.CY

A Federated Online Restless Bandit Framework for Cooperative Resource Allocation

Jingwen Tong, Xinran Li, Liqun Fu, Jun Zhang, Khaled B. Letaief

Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In this paper, we study the cooperative resource allocation problem with unknown system dynamics of MRPs. This problem can be modeled as a multi-agent online RMAB problem, where multiple agents collaboratively learn the system dynamics while maximizing their accumulated rewards. We devise a federated online RMAB framework to mitigate the communication overhead and data privacy issue by adopting the federated learning paradigm. Based on this framework, we put forth a Federated Thompson Sampling-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. The FedTSWI algorithm enjoys a high communication and computation efficiency, and a privacy guarantee. Moreover, we derive a regret upper bound for the FedTSWI algorithm. Finally, we demonstrate the effectiveness of the proposed algorithm on the case of online multi-user multi-channel access. Numerical results show that the proposed algorithm achieves a fast convergence rate of $mathcal{O}(sqrt{Tlog(T)})$ and better performance compared with baselines. More importantly, its sample complexity decreases with the number of agents.

6/13/2024

cs.LG eess.SP

🏅

Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond

Xutong Liu, Siwei Wang, Jinhang Zuo, Han Zhong, Xuchuang Wang, Zhiyong Wang, Shuai Li, Mohammad Hajiesmaili, John C. S. Lui, Wei Chen

We introduce a novel framework of combinatorial multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT), where the outcome of each arm is a $d$-dimensional multivariant random variable and the feedback follows a general arm triggering process. Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables. For CMAB-MT, we propose a general 1-norm multivariant and triggering probability-modulated smoothness condition, and an optimistic CUCB-MT algorithm built upon this condition. Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution, all of which meet the above smoothness condition and achieve matching or improved regret bounds compared to existing works. Through our new framework, we build the first connection between the episodic RL and CMAB literature, by offering a new angle to solve the episodic RL through the lens of CMAB, which may encourage more interactions between these two important directions.

6/4/2024

cs.LG

🗣️

Adversarial Attacks on Combinatorial Multi-Armed Bandits

Rishab Balasubramanian, Jiawei Li, Prasad Tadepalli, Huazheng Wang, Qingyun Wu, Haoyu Zhao

We study reward poisoning attacks on Combinatorial Multi-armed Bandits (CMAB). We first provide a sufficient and necessary condition for the attackability of CMAB, a notion to capture the vulnerability and robustness of CMAB. The attackability condition depends on the intrinsic properties of the corresponding CMAB instance such as the reward distributions of super arms and outcome distributions of base arms. Additionally, we devise an attack algorithm for attackable CMAB instances. Contrary to prior understanding of multi-armed bandits, our work reveals a surprising fact that the attackability of a specific CMAB instance also depends on whether the bandit instance is known or unknown to the adversary. This finding indicates that adversarial attacks on CMAB are difficult in practice and a general attack strategy for any CMAB instance does not exist since the environment is mostly unknown to the adversary. We validate our theoretical findings via extensive experiments on real-world CMAB applications including probabilistic maximum covering problem, online minimum spanning tree, cascading bandits for online ranking, and online shortest path.

6/5/2024

cs.LG cs.DS stat.ML