Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

Read original: arXiv:2408.10865 - Published 8/21/2024 by Hong Xie, Jinyu Mo, Defu Lian, Jie Wang, Enhong Chen

Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

Overview

This paper proposes a new multi-agent multi-armed bandit problem with stochastic sharable arm capacities.
The goal is to maximize the total reward across multiple agents by optimally allocating limited shared resources.
The paper introduces an efficient algorithm called TSAC that provably achieves near-optimal performance.

Plain English Explanation

In this research, the authors look at a scenario where multiple agents are trying to earn rewards by pulling different "arms" of a multi-armed bandit machine. However, the capacity of each arm is limited and can be shared across the agents.

This means that if multiple agents want to pull the same arm, there may not be enough "room" for all of them to do so. The agents need to strategize and coordinate how they use these limited shared resources to maximize their collective reward.

The paper proposes a new algorithm called TSAC (Thompson Sampling for Arm Capacities) that helps the agents efficiently navigate this challenge. TSAC allows the agents to learn about the arm capacities over time and make smart decisions about how to allocate the limited resources. The authors prove that TSAC can achieve near-optimal performance, meaning it gets very close to the best possible reward that could be earned in this scenario.

Technical Explanation

The key elements of the paper are:

Problem Formulation: The authors introduce the Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities problem, which models a setting where multiple agents compete for the limited capacity of each arm in a multi-armed bandit.
TSAC Algorithm: They propose the TSAC (Thompson Sampling for Arm Capacities) algorithm, which uses Thompson Sampling to learn the arm capacities and make allocation decisions that balance exploration and exploitation.
Theoretical Analysis: The authors prove that TSAC achieves a near-optimal regret bound, meaning it performs almost as well as the best possible algorithm for this problem.
Experiments: The paper includes simulations demonstrating the superior performance of TSAC compared to baseline approaches on synthetic and real-world datasets.

Critical Analysis

The paper provides a robust theoretical analysis and practical algorithm for the multi-agent multi-armed bandit problem with limited shared resources. However, the authors acknowledge that their model assumes the arm capacities are independent and stationary, which may not always hold in real-world applications.

Additionally, the paper does not explore the impact of communication and coordination between the agents, which could significantly affect the performance of the algorithm. Further research could investigate more dynamic and realistic scenarios, such as federated learning settings or hierarchical multi-armed bandits.

Conclusion

This paper presents a novel formulation of the multi-agent multi-armed bandit problem with limited shared resources and an efficient algorithm, TSAC, that can achieve near-optimal performance. The theoretical analysis and experimental results demonstrate the potential of this approach for applications where multiple agents need to coordinate their actions to maximize collective reward. While the model has some simplifying assumptions, the work lays a strong foundation for further research in this important area of multi-agent decision-making.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

Hong Xie, Jinyu Mo, Defu Lian, Jie Wang, Enhong Chen

Motivated by distributed selection problems, we formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures stochastic arrival of requests to each arm, as well as the policy of allocating requests to players. The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile (an arm pulling profile prescribes the number of players at each arm) without communicating to each other. We first design a greedy algorithm, which locates one of the optimal arm pulling profiles with a polynomial computational complexity. We also design an iterative distributed algorithm for players to commit to an optimal arm pulling profile with a constant number of rounds in expectation. We apply the explore then commit (ETC) framework to address the online setting when model parameters are unknown. We design an exploration strategy for players to estimate the optimal arm pulling profile. Since such estimates can be different across different players, it is challenging for players to commit. We then design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds. We conduct experiments to validate our algorithm.

8/21/2024

🏅

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Ronshee Chawla, Daniel Vial, Sanjay Shakkottai, R. Srikant

The study of collaborative multi-agent bandits has attracted significant attention recently. In light of this, we initiate the study of a new collaborative setting, consisting of $N$ agents such that each agent is learning one of $M$ stochastic multi-armed bandits to minimize their group cumulative regret. We develop decentralized algorithms which facilitate collaboration between the agents under two scenarios. We characterize the performance of these algorithms by deriving the per agent cumulative regret and group regret upper bounds. We also prove lower bounds for the group regret in this setting, which demonstrates the near-optimal behavior of the proposed algorithms.

7/4/2024

A Federated Online Restless Bandit Framework for Cooperative Resource Allocation

Jingwen Tong, Xinran Li, Liqun Fu, Jun Zhang, Khaled B. Letaief

Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In this paper, we study the cooperative resource allocation problem with unknown system dynamics of MRPs. This problem can be modeled as a multi-agent online RMAB problem, where multiple agents collaboratively learn the system dynamics while maximizing their accumulated rewards. We devise a federated online RMAB framework to mitigate the communication overhead and data privacy issue by adopting the federated learning paradigm. Based on this framework, we put forth a Federated Thompson Sampling-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. The FedTSWI algorithm enjoys a high communication and computation efficiency, and a privacy guarantee. Moreover, we derive a regret upper bound for the FedTSWI algorithm. Finally, we demonstrate the effectiveness of the proposed algorithm on the case of online multi-user multi-channel access. Numerical results show that the proposed algorithm achieves a fast convergence rate of $mathcal{O}(sqrt{Tlog(T)})$ and better performance compared with baselines. More importantly, its sample complexity decreases with the number of agents.

6/13/2024

🤯

Federated Combinatorial Multi-Agent Multi-Armed Bandits

Fares Fourati, Mohamed-Slim Alouini, Vaneet Aggarwal

This paper introduces a federated learning framework tailored for online combinatorial optimization with bandit feedback. In this setting, agents select subsets of arms, observe noisy rewards for these subsets without accessing individual arm information, and can cooperate and share information at specific intervals. Our framework transforms any offline resilient single-agent $(alpha-epsilon)$-approximation algorithm, having a complexity of $tilde{mathcal{O}}(frac{psi}{epsilon^beta})$, where the logarithm is omitted, for some function $psi$ and constant $beta$, into an online multi-agent algorithm with $m$ communicating agents and an $alpha$-regret of no more than $tilde{mathcal{O}}(m^{-frac{1}{3+beta}} psi^frac{1}{3+beta} T^frac{2+beta}{3+beta})$. This approach not only eliminates the $epsilon$ approximation error but also ensures sublinear growth with respect to the time horizon $T$ and demonstrates a linear speedup with an increasing number of communicating agents. Additionally, the algorithm is notably communication-efficient, requiring only a sublinear number of communication rounds, quantified as $tilde{mathcal{O}}left(psi T^frac{beta}{beta+1}right)$. Furthermore, the framework has been successfully applied to online stochastic submodular maximization using various offline algorithms, yielding the first results for both single-agent and multi-agent settings and recovering specialized single-agent theoretical guarantees. We empirically validate our approach to a stochastic data summarization problem, illustrating the effectiveness of the proposed framework, even in single-agent scenarios.

5/10/2024