Delay as Payoff in MAB

Read original: arXiv:2408.15158 - Published 8/28/2024 by Ofir Schlisselberg, Ido Cohen, Tal Lancewicki, Yishay Mansour

✅

Overview

This paper investigates a variant of the classical stochastic Multi-armed Bandit (MAB) problem.
In this setting, the payoff received by an agent (either cost or reward) is both delayed and directly corresponds to the magnitude of the delay.
This models real-world scenarios like the time it takes for a data packet to traverse a network or a user's time spent on a web page.

Plain English Explanation

The paper looks at a specific type of multi-armed bandit problem, which is a common setup in machine learning and optimization. In a standard multi-armed bandit problem, an agent has to choose between several "arms" (options) that each have an unknown payoff. The agent's goal is to maximize their total payoff over time by learning which arms are best.

In this paper, the authors consider a twist on the standard problem - the payoff the agent receives is not immediately known, but is instead delayed and directly related to the amount of delay. So the longer the agent has to wait for their payoff, the higher or lower it will be.

This delay-based payoff structure mirrors real-world scenarios like:

Choosing a network route, where the "payoff" is the time it takes for a data packet to traverse the network
Choosing web content, where the "payoff" is the time a user spends engaging with that content

The key contribution of the paper is deriving tight upper and lower bounds on the regret (difference between the agent's total payoff and the optimal payoff) for both the "delay as cost" and "delay as reward" settings. These regret bounds provide a theoretical understanding of how well algorithms can perform in these delay-based multi-armed bandit problems.

Technical Explanation

The paper analyzes a variant of the classical stochastic multi-armed bandit (MAB) problem where the payoff received by the agent (either cost or reward) is both delayed and directly corresponds to the magnitude of the delay.

For the "delay as cost" setting, the authors prove an optimal regret bound that scales as:

Σ_i:Δ_i > 0 (log T) / Δ_i + d^*

where T is the maximum number of steps, Δ_i are the sub-optimality gaps, and d^* is the minimal expected delay amongst the arms.

For the "delay as reward" setting, the authors show an optimal regret bound of:

Σ_i:Δ_i > 0 (log T) / Δ_i + d̄

where d̄ is the second maximal expected delay.

These regret bounds improve upon the general delay-dependent payoff setting, which scales as:

Σ_i:Δ_i > 0 (log T) / Δ_i + D

where D is the maximum possible delay.

The authors' results highlight the key differences between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward.

The paper also includes an empirical evaluation to complement the theoretical analysis.

Critical Analysis

The paper makes a strong theoretical contribution by deriving tight upper and lower bounds on the regret for multi-armed bandit problems with delayed and delay-dependent payoffs. This provides a principled understanding of the inherent difficulty of these problems and the performance limits of any algorithm.

One potential limitation is that the analysis assumes the delays are stochastic with known distributions. In practice, the delay distributions may not be known a priori, which could impact the algorithm performance and regret bounds.

Additionally, the paper focuses on the single-agent setting. It would be interesting to explore extensions to multi-agent scenarios, where agents' choices may interfere with each other's payoffs, as in this related work.

Overall, this is a technically solid paper that makes important theoretical advances in the multi-armed bandit literature. The findings could have implications for real-world applications involving delayed or time-dependent payoffs, such as network routing, content recommendation, and online advertising.

Conclusion

This paper investigates a variant of the classical multi-armed bandit problem where the payoff received by an agent is both delayed and directly related to the magnitude of the delay. The authors derive tight upper and lower bounds on the regret for both the "delay as cost" and "delay as reward" settings, improving upon the general delay-dependent payoff case.

These theoretical results provide a principled understanding of the inherent difficulty of these delay-based multi-armed bandit problems and the performance limits of any algorithm. While the analysis assumes known delay distributions, the findings could still have implications for real-world applications involving delayed or time-dependent payoffs, such as network routing, content recommendation, and online advertising.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✅

Delay as Payoff in MAB

Ofir Schlisselberg, Ido Cohen, Tal Lancewicki, Yishay Mansour

In this paper, we investigate a variant of the classical stochastic Multi-armed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent's cost); or a user's time spent on a web page given a choice of content (where delay serves as the agent's reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales as $sum_{i:Delta_i > 0}frac{log T}{Delta_i} + d^*$, where $T$ is the maximal number of steps, $Delta_i$ are the sub-optimality gaps and $d^*$ is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret of $sum_{i:Delta_i > 0}frac{log T}{Delta_i} + bar{d}$, where $bar d$ is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales as $sum_{i:Delta_i > 0}frac{log T}{Delta_i} + D$, where $D$ is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation.

8/28/2024

🌀

Blocking Bandits

Soumya Basu, Rajat Sen, Sujay Sanghavi, Sanjay Shakkottai

We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically $(1-1/e)$ optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of $c' log T+ omega(log T)$.

7/31/2024

Multi-Armed Bandits with Interference

Su Jia, Peter Frazier, Nathan Kallus

Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. The cumulative performance, while equally crucial, is less well understood. To address this gap, we introduce the problem of {em Multi-armed Bandits with Interference} (MABI), where the learner assigns an arm to each of $N$ experimental units over a time horizon of $T$ rounds. The reward of each unit in each round depends on the treatments of {em all} units, where the influence of a unit decays in the spatial distance between units. Furthermore, we employ a general setup wherein the reward functions are chosen by an adversary and may vary arbitrarily across rounds and units. We first show that switchback policies achieve an optimal {em expected} regret $tilde O(sqrt T)$ against the best fixed-arm policy. Nonetheless, the regret (as a random variable) for any switchback policy suffers a high variance, as it does not account for $N$. We propose a cluster randomization policy whose regret (i) is optimal in {em expectation} and (ii) admits a high probability bound that vanishes in $N$.

7/17/2024

🔍

A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays

Saeed Masoudian, Julian Zimmert, Yevgeny Seldin

We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback. In contrast to prior work, which required prior knowledge of the maximal delay $d_{mathrm{max}}$ and had a linear dependence of the regret on it, our algorithm can tolerate arbitrary excessive delays up to order $T$ (where $T$ is the time horizon). The algorithm is based on three technical innovations, which may all be of independent interest: (1) We introduce the first implicit exploration scheme that works in best-of-both-worlds setting. (2) We introduce the first control of distribution drift that does not rely on boundedness of delays. The control is based on the implicit exploration scheme and adaptive skipping of observations with excessive delays. (3) We introduce a procedure relating standard regret with drifted regret that does not rely on boundedness of delays. At the conceptual level, we demonstrate that complexity of best-of-both-worlds bandits with delayed feedback is characterized by the amount of information missing at the time of decision making (measured by the number of outstanding observations) rather than the time that the information is missing (measured by the delays).

5/29/2024