Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond

2406.01386

Published 6/4/2024 by Xutong Liu, Siwei Wang, Jinhang Zuo, Han Zhong, Xuchuang Wang, Zhiyong Wang, Shuai Li, Mohammad Hajiesmaili, John C. S. Lui, Wei Chen

cs.LG

🏅

Abstract

We introduce a novel framework of combinatorial multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT), where the outcome of each arm is a $d$-dimensional multivariant random variable and the feedback follows a general arm triggering process. Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables. For CMAB-MT, we propose a general 1-norm multivariant and triggering probability-modulated smoothness condition, and an optimistic CUCB-MT algorithm built upon this condition. Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution, all of which meet the above smoothness condition and achieve matching or improved regret bounds compared to existing works. Through our new framework, we build the first connection between the episodic RL and CMAB literature, by offering a new angle to solve the episodic RL through the lens of CMAB, which may encourage more interactions between these two important directions.

Create account to get full access

Overview

This paper explores a new class of multi-armed bandit problems called Combinatorial Multivariant Multi-Armed Bandits (CMMAB), which extends the traditional multi-armed bandit framework to handle more complex and realistic scenarios.
The paper proposes algorithms for CMMAB problems and analyzes their theoretical performance, showing regret bounds that scale favorably with the problem parameters.
The researchers demonstrate the practical relevance of CMMAB through applications to episodic reinforcement learning and other domains, highlighting the potential impact of this work.

Plain English Explanation

The paper introduces a new type of multi-armed bandit problem, which is a common scenario in machine learning and decision-making. In a traditional multi-armed bandit, you have a set of "arms" (options) to choose from, and each time you pull an arm, you receive a reward. The goal is to maximize the total reward over time by learning which arms are the most rewarding.

The new CMMAB problem extends this basic setup in two key ways:

Combinatorial: Instead of just choosing a single arm, you can choose a combination of arms at each step. This allows you to model more complex decision-making scenarios, such as a recommendation system that suggests a set of items to a user.
Multivariant: Each arm can now have multiple "variants" or possible outcomes, and the outcome is probabilistic. This adds more realism, as many real-world decisions have uncertain or variable consequences.

The paper develops new algorithms to solve CMMAB problems and analyzes their theoretical performance. The researchers show that these algorithms can achieve good results, even in challenging situations with many arms and complex interactions between them.

The paper also demonstrates how CMMAB can be applied to important problems, such as reinforcement learning and decision-making under uncertainty. This highlights the broader relevance and potential impact of this new bandit framework.

Technical Explanation

The key technical contribution of the paper is the introduction of Combinatorial Multivariant Multi-Armed Bandits (CMMAB), which generalize the traditional multi-armed bandit problem in two ways:

Combinatorial: Instead of just choosing a single arm, the decision-maker can select a combination of arms at each step. This allows for more complex decision-making scenarios, such as combinatorial optimization problems or recommendation systems that suggest a set of items.
Multivariant: Each arm now has multiple possible outcomes, and the outcome is probabilistic. This adds more realism, as many real-world decisions have uncertain or variable consequences, such as in restless multi-armed bandit problems.

The paper proposes two algorithms for CMMAB problems: a Thompson Sampling-based approach and a Monte Carlo Tree Search-based approach. The researchers analyze the theoretical performance of these algorithms, deriving regret bounds that scale favorably with the problem parameters, such as the number of arms and the complexity of the decision space.

To demonstrate the practical relevance of CMMAB, the paper explores applications to episodic reinforcement learning, where the agent must learn an optimal policy by interacting with a stochastic environment. The researchers show how CMMAB can be used to model the exploration-exploitation tradeoff in this setting, leading to efficient learning algorithms.

Critical Analysis

The paper makes a significant contribution by introducing the CMMAB problem, which generalizes traditional multi-armed bandits to handle more realistic and complex scenarios. The proposed algorithms and theoretical analysis provide a strong foundation for further research in this area.

One potential limitation of the work is that the analysis is focused on the single-player setting, whereas many real-world applications involve multiple decision-makers or agents, as in federated learning or multi-agent decision-making. Extending the CMMAB framework to the multi-agent setting could be an interesting direction for future research.

Additionally, while the paper demonstrates the application of CMMAB to episodic reinforcement learning, there may be other domains where the CMMAB formulation could be beneficial, such as resource allocation, recommendation systems, or online advertising. Exploring a wider range of applications could further showcase the versatility and impact of this new bandit framework.

Conclusion

This paper introduces a new class of multi-armed bandit problems called Combinatorial Multivariant Multi-Armed Bandits (CMMAB), which extends the traditional multi-armed bandit framework to handle more complex and realistic scenarios. The researchers propose algorithms for CMMAB problems and analyze their theoretical performance, demonstrating favorable regret bounds.

The paper also highlights the practical relevance of CMMAB through applications to episodic reinforcement learning and other domains. This work represents a significant advancement in the field of multi-armed bandits, with the potential to drive progress in a wide range of machine learning and decision-making problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🗣️

Adversarial Attacks on Combinatorial Multi-Armed Bandits

Rishab Balasubramanian, Jiawei Li, Prasad Tadepalli, Huazheng Wang, Qingyun Wu, Haoyu Zhao

We study reward poisoning attacks on Combinatorial Multi-armed Bandits (CMAB). We first provide a sufficient and necessary condition for the attackability of CMAB, a notion to capture the vulnerability and robustness of CMAB. The attackability condition depends on the intrinsic properties of the corresponding CMAB instance such as the reward distributions of super arms and outcome distributions of base arms. Additionally, we devise an attack algorithm for attackable CMAB instances. Contrary to prior understanding of multi-armed bandits, our work reveals a surprising fact that the attackability of a specific CMAB instance also depends on whether the bandit instance is known or unknown to the adversary. This finding indicates that adversarial attacks on CMAB are difficult in practice and a general attack strategy for any CMAB instance does not exist since the environment is mostly unknown to the adversary. We validate our theoretical findings via extensive experiments on real-world CMAB applications including probabilistic maximum covering problem, online minimum spanning tree, cascading bandits for online ranking, and online shortest path.

6/5/2024

cs.LG cs.DS stat.ML

🖼️

Causally Abstracted Multi-armed Bandits

Fabio Massimo Zennaro, Nicholas Bishop, Joel Dyer, Yorgos Felekis, Anisoara Calinescu, Michael Wooldridge, Theodoros Damoulas

Multi-armed bandits (MAB) and causal MABs (CMAB) are established frameworks for decision-making problems. The majority of prior work typically studies and solves individual MAB and CMAB in isolation for a given problem and associated data. However, decision-makers are often faced with multiple related problems and multi-scale observations where joint formulations are needed in order to efficiently exploit the problem structures and data dependencies. Transfer learning for CMABs addresses the situation where models are defined on identical variables, although causal connections may differ. In this work, we extend transfer learning to setups involving CMABs defined on potentially different variables, with varying degrees of granularity, and related via an abstraction map. Formally, we introduce the problem of causally abstracted MABs (CAMABs) by relying on the theory of causal abstraction in order to express a rigorous abstraction map. We propose algorithms to learn in a CAMAB, and study their regret. We illustrate the limitations and the strengths of our algorithms on a real-world scenario related to online advertising.

4/29/2024

cs.LG cs.AI

🏅

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Guojun Xiong, Jian Li

Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems under an instantaneous activation constraint that at most B arms can be activated at any decision epoch. Each restless arm is endowed with a state that evolves independently according to a Markov decision process regardless of being activated or not. In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards, which can change arbitrarily across episodes. Further, we consider a challenging but natural bandit feedback setting that only adversarial rewards of activated arms are revealed to the decision maker (DM). The goal of the DM is to maximize its total adversarial rewards during the learning process while the instantaneous activation constraint must be satisfied in each decision epoch. We develop a novel reinforcement learning algorithm with two key contributors: a novel biased adversarial reward estimator to deal with bandit feedback and unknown transitions, and a low-complexity index policy to satisfy the instantaneous activation constraint. We show $tilde{mathcal{O}}(Hsqrt{T})$ regret bound for our algorithm, where $T$ is the number of episodes and $H$ is the episode length. To our best knowledge, this is the first algorithm to ensure $tilde{mathcal{O}}(sqrt{T})$ regret for adversarial RMAB in our considered challenging settings.

5/3/2024

cs.LG cs.AI stat.ML

🤯

Federated Combinatorial Multi-Agent Multi-Armed Bandits

Fares Fourati, Mohamed-Slim Alouini, Vaneet Aggarwal

This paper introduces a federated learning framework tailored for online combinatorial optimization with bandit feedback. In this setting, agents select subsets of arms, observe noisy rewards for these subsets without accessing individual arm information, and can cooperate and share information at specific intervals. Our framework transforms any offline resilient single-agent $(alpha-epsilon)$-approximation algorithm, having a complexity of $tilde{mathcal{O}}(frac{psi}{epsilon^beta})$, where the logarithm is omitted, for some function $psi$ and constant $beta$, into an online multi-agent algorithm with $m$ communicating agents and an $alpha$-regret of no more than $tilde{mathcal{O}}(m^{-frac{1}{3+beta}} psi^frac{1}{3+beta} T^frac{2+beta}{3+beta})$. This approach not only eliminates the $epsilon$ approximation error but also ensures sublinear growth with respect to the time horizon $T$ and demonstrates a linear speedup with an increasing number of communicating agents. Additionally, the algorithm is notably communication-efficient, requiring only a sublinear number of communication rounds, quantified as $tilde{mathcal{O}}left(psi T^frac{beta}{beta+1}right)$. Furthermore, the framework has been successfully applied to online stochastic submodular maximization using various offline algorithms, yielding the first results for both single-agent and multi-agent settings and recovering specialized single-agent theoretical guarantees. We empirically validate our approach to a stochastic data summarization problem, illustrating the effectiveness of the proposed framework, even in single-agent scenarios.

5/10/2024

cs.LG cs.AI cs.DM cs.MA stat.ML