Multi-Armed Bandits with Network Interference

2405.18621

Published 5/30/2024 by Abhineet Agarwal, Anish Agarwal, Lorenzo Masoero, Justin Whitehouse

Multi-Armed Bandits with Network Interference

Abstract

Online experimentation with interference is a common challenge in modern applications such as e-commerce and adaptive clinical trials in medicine. For example, in online marketplaces, the revenue of a good depends on discounts applied to competing goods. Statistical inference with interference is widely studied in the offline setting, but far less is known about how to adaptively assign treatments to minimize regret. We address this gap by studying a multi-armed bandit (MAB) problem where a learner (e-commerce platform) sequentially assigns one of possible $mathcal{A}$ actions (discounts) to $N$ units (goods) over $T$ rounds to minimize regret (maximize revenue). Unlike traditional MAB problems, the reward of each unit depends on the treatments assigned to other units, i.e., there is interference across the underlying network of units. With $mathcal{A}$ actions and $N$ units, minimizing regret is combinatorially difficult since the action space grows as $mathcal{A}^N$. To overcome this issue, we study a sparse network interference model, where the reward of a unit is only affected by the treatments assigned to $s$ neighboring units. We use tools from discrete Fourier analysis to develop a sparse linear representation of the unit-specific reward $r_n: [mathcal{A}]^N rightarrow mathbb{R} $, and propose simple, linear regression-based algorithms to minimize regret. Importantly, our algorithms achieve provably low regret both when the learner observes the interference neighborhood for all units and when it is unknown. This significantly generalizes other works on this topic which impose strict conditions on the strength of interference on a known network, and also compare regret to a markedly weaker optimal action. Empirically, we corroborate our theoretical findings via numerical simulations.

Create account to get full access

Overview

Explores a multi-armed bandit problem with network interference, where the reward of one arm can depend on the actions of neighboring arms
Proposes a novel algorithm, called Network-UCB, that uses information about the network structure to improve performance
Provides theoretical guarantees on the algorithm's regret, demonstrating its effectiveness compared to standard multi-armed bandit algorithms

Plain English Explanation

In the multi-armed bandit problem, an agent has to choose from a set of "arms" (options) and receive a reward, with the goal of maximizing the total reward over time. However, in this case, the reward of an arm can depend on the actions of other, neighboring arms due to a network structure.

The researchers developed a new algorithm called Network-UCB that takes advantage of this network information to make better decisions. Instead of just focusing on the rewards of each arm independently, Network-UCB considers how the arms are connected and the potential influence they have on each other. This allows the algorithm to explore the arms more effectively and converge to the optimal set of arms faster than standard multi-armed bandit algorithms.

The team also provided theoretical guarantees, showing that Network-UCB achieves lower regret - the difference between the total reward of the optimal strategy and the total reward of the algorithm - compared to other approaches. This means Network-UCB can learn the best set of arms to pull more efficiently.

Technical Explanation

The paper studies a multi-armed bandit problem with network interference, where the reward of an arm can depend on the actions of neighboring arms in a network. The authors propose a novel algorithm called Network-UCB that leverages the network structure to make more informed decisions.

Network-UCB works by maintaining a confidence bound for each arm, similar to the standard Upper Confidence Bound (UCB) algorithm. However, instead of just considering the rewards of each arm independently, Network-UCB also takes into account the potential influence of neighboring arms based on the network structure. This allows the algorithm to explore the arms more efficiently and converge to the optimal set of arms faster.

The authors provide theoretical guarantees on the regret of Network-UCB, showing that it achieves a better regret bound compared to standard multi-armed bandit algorithms, even in the presence of network interference. This is an important result, as it demonstrates the effectiveness of incorporating network information into the decision-making process.

Critical Analysis

The paper presents a well-designed and theoretically sound approach to the multi-armed bandit problem with network interference. The authors clearly identify the limitations of standard multi-armed bandit algorithms in this setting and propose a novel solution that takes advantage of the network structure.

One potential limitation of the research is that it assumes the network structure is known in advance. In practice, the network may not be fully observable, and the algorithm would need to learn the network structure simultaneously. The authors acknowledge this and suggest it as a direction for future research.

Additionally, the paper focuses on the theoretical analysis and does not provide extensive experimental results. While the theoretical guarantees are compelling, it would be helpful to see how Network-UCB performs in realistic scenarios and how it compares to other state-of-the-art algorithms for multi-armed bandits with network interference.

Conclusion

This paper makes a significant contribution to the field of multi-armed bandit learning by proposing a novel algorithm, Network-UCB, that can effectively handle the challenge of network interference. The theoretical analysis demonstrates the algorithm's superior performance, and the insights gained from this research could inform the development of more advanced multi-agent decision-making systems. Future work could explore learning the network structure and evaluating the algorithm's practical performance in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Multi-Agent Bandit Learning through Heterogeneous Action Erasure Channels

Osama A. Hanna, Merve Karakas, Lin F. Yang, Christina Fragouli

Multi-Armed Bandit (MAB) systems are witnessing an upswing in applications within multi-agent distributed environments, leading to the advancement of collaborative MAB algorithms. In such settings, communication between agents executing actions and the primary learner making decisions can hinder the learning process. A prevalent challenge in distributed learning is action erasure, often induced by communication delays and/or channel noise. This results in agents possibly not receiving the intended action from the learner, subsequently leading to misguided feedback. In this paper, we introduce novel algorithms that enable learners to interact concurrently with distributed agents across heterogeneous action erasure channels with different action erasure probabilities. We illustrate that, in contrast to existing bandit algorithms, which experience linear regret, our algorithms assure sub-linear regret guarantees. Our proposed solutions are founded on a meticulously crafted repetition protocol and scheduling of learning across heterogeneous channels. To our knowledge, these are the first algorithms capable of effectively learning through heterogeneous action erasure channels. We substantiate the superior performance of our algorithm through numerical experiments, emphasizing their practical significance in addressing issues related to communication constraints and delays in multi-agent environments.

4/30/2024

cs.LG cs.DC cs.MA

🔍

Learning for Bandits under Action Erasures

Osama Hanna, Merve Karakas, Lin F. Yang, Christina Fragouli

We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels, while the rewards for the actions are directly available to the learner through external sensors. In our model, while the distributed agents know if an action is erased, the central learner does not (there is no feedback), and thus does not know whether the observed reward resulted from the desired action or not. We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over action-erasure channels that is at most a factor of $O(1/sqrt{1-epsilon})$ away from the no-erasure worst-case regret of the underlying MAB algorithm, where $epsilon$ is the erasure probability. We also propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is $Tilde{O}(sqrt{KT}+K/(1-epsilon))$, which we prove is optimal by providing a matching lower bound.

6/27/2024

stat.ML cs.LG

A Federated Online Restless Bandit Framework for Cooperative Resource Allocation

Jingwen Tong, Xinran Li, Liqun Fu, Jun Zhang, Khaled B. Letaief

Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In this paper, we study the cooperative resource allocation problem with unknown system dynamics of MRPs. This problem can be modeled as a multi-agent online RMAB problem, where multiple agents collaboratively learn the system dynamics while maximizing their accumulated rewards. We devise a federated online RMAB framework to mitigate the communication overhead and data privacy issue by adopting the federated learning paradigm. Based on this framework, we put forth a Federated Thompson Sampling-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. The FedTSWI algorithm enjoys a high communication and computation efficiency, and a privacy guarantee. Moreover, we derive a regret upper bound for the FedTSWI algorithm. Finally, we demonstrate the effectiveness of the proposed algorithm on the case of online multi-user multi-channel access. Numerical results show that the proposed algorithm achieves a fast convergence rate of $mathcal{O}(sqrt{Tlog(T)})$ and better performance compared with baselines. More importantly, its sample complexity decreases with the number of agents.

6/13/2024

cs.LG eess.SP

🗣️

Adversarial Attacks on Combinatorial Multi-Armed Bandits

Rishab Balasubramanian, Jiawei Li, Prasad Tadepalli, Huazheng Wang, Qingyun Wu, Haoyu Zhao

We study reward poisoning attacks on Combinatorial Multi-armed Bandits (CMAB). We first provide a sufficient and necessary condition for the attackability of CMAB, a notion to capture the vulnerability and robustness of CMAB. The attackability condition depends on the intrinsic properties of the corresponding CMAB instance such as the reward distributions of super arms and outcome distributions of base arms. Additionally, we devise an attack algorithm for attackable CMAB instances. Contrary to prior understanding of multi-armed bandits, our work reveals a surprising fact that the attackability of a specific CMAB instance also depends on whether the bandit instance is known or unknown to the adversary. This finding indicates that adversarial attacks on CMAB are difficult in practice and a general attack strategy for any CMAB instance does not exist since the environment is mostly unknown to the adversary. We validate our theoretical findings via extensive experiments on real-world CMAB applications including probabilistic maximum covering problem, online minimum spanning tree, cascading bandits for online ranking, and online shortest path.

6/5/2024

cs.LG cs.DS stat.ML