A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays

Read original: arXiv:2308.10675 - Published 5/29/2024 by Saeed Masoudian, Julian Zimmert, Yevgeny Seldin

🔍

Overview

Proposes a new algorithm for bandits with variable delayed feedback
Improves on prior work by not requiring prior knowledge of maximum delay and having better regret bounds
Introduces three key technical innovations related to exploration, drift control, and relating standard and drifted regret

Plain English Explanation

This research paper introduces a new algorithm for a type of multi-armed bandit problem where the feedback (reward information) is delayed by an unknown and variable amount of time.

Prior algorithms required knowing the maximum possible delay ahead of time, and their performance depended linearly on this maximum delay. In contrast, the new algorithm can handle arbitrary excessive delays up to the total time horizon, without needing to know the maximum delay.

The key innovations are:

A new exploration scheme that works well in this "best-of-both-worlds" setting with delayed feedback.
A way to control the drift in the distribution of rewards over time, without relying on bounded delays.
A procedure to relate standard regret (a measure of performance) to "drifted regret", again without needing bounded delays.

The core insight is that the complexity of this bandit problem is determined by the amount of missing information at the time of decision-making, rather than the absolute time that the information is missing. This allows the new algorithm to achieve better performance than previous approaches.

Technical Explanation

The paper proposes a new algorithm for bandits with variably delayed feedback. In contrast to prior work, which required prior knowledge of the maximum delay $d_{max}$ and had a linear dependence of the regret on it, the new algorithm can tolerate arbitrary excessive delays up to the time horizon $T$.

This improvement is achieved through three key technical innovations:

The authors introduce the first implicit exploration scheme that works in the best-of-both-worlds setting with delayed feedback.
They introduce the first way to control the drift in the distribution of rewards over time, without relying on bounded delays.
They introduce a procedure to relate standard regret to "drifted regret" without needing bounded delays.

At a conceptual level, the paper demonstrates that the complexity of best-of-both-worlds bandits with delayed feedback is characterized by the amount of missing information at decision time, rather than the absolute time that the information is missing.

Critical Analysis

The paper introduces several novel technical ideas to address the challenge of bandits with variable delayed feedback. The authors thoroughly analyze the performance of their algorithm and provide rigorous theoretical guarantees.

One potential limitation is that the analysis assumes the delays are independent of the chosen actions. In practice, there may be correlations between the actions and the delay distributions that are not captured by the model.

Additionally, the paper does not consider the computational complexity of implementing the algorithm, which could be an important practical concern. Further research may be needed to develop efficient implementations.

Overall, this work makes a significant contribution by advancing the state-of-the-art in bandit problems with delayed feedback. The technical innovations and conceptual insights are likely to be of interest to researchers working on related problems in multi-armed bandits, reinforcement learning, and online decision-making under uncertainty.

Conclusion

This paper proposes a new algorithm for bandits with variable delayed feedback that outperforms previous approaches. The key innovations include a novel exploration scheme, a way to control distribution drift without bounded delays, and a procedure to relate standard and drifted regret.

The core insight is that the complexity of these bandit problems is determined by the amount of missing information at decision time, rather than the absolute time that the information is missing. This allows the new algorithm to achieve better regret bounds than prior work.

The technical contributions and conceptual advances made in this paper are likely to have a broader impact on the field of online decision-making under uncertainty, with potential applications in areas like reinforcement learning, contextual bandits, and recommendation systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays

Saeed Masoudian, Julian Zimmert, Yevgeny Seldin

We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback. In contrast to prior work, which required prior knowledge of the maximal delay $d_{mathrm{max}}$ and had a linear dependence of the regret on it, our algorithm can tolerate arbitrary excessive delays up to order $T$ (where $T$ is the time horizon). The algorithm is based on three technical innovations, which may all be of independent interest: (1) We introduce the first implicit exploration scheme that works in best-of-both-worlds setting. (2) We introduce the first control of distribution drift that does not rely on boundedness of delays. The control is based on the implicit exploration scheme and adaptive skipping of observations with excessive delays. (3) We introduce a procedure relating standard regret with drifted regret that does not rely on boundedness of delays. At the conceptual level, we demonstrate that complexity of best-of-both-worlds bandits with delayed feedback is characterized by the amount of information missing at the time of decision making (measured by the number of outstanding observations) rather than the time that the information is missing (measured by the delays).

5/29/2024

Biased Dueling Bandits with Stochastic Delayed Feedback

Bongsoo Yi, Yue Kang, Yao Li

The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information retrieval, and more. However, in many real-world applications, the feedback for actions is often subject to unavoidable delays and is not immediately available to the agent. This partially observable issue poses a significant challenge to existing dueling bandit literature, as it significantly affects how quickly and accurately the agent can update their policy on the fly. In this paper, we introduce and examine the biased dueling bandit problem with stochastic delayed feedback, revealing that this new practical problem will delve into a more realistic and intriguing scenario involving a preference bias between the selections. We present two algorithms designed to handle situations involving delay. Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay. The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available. We provide a comprehensive regret analysis for the two proposed algorithms and then evaluate their empirical performance on both synthetic and real datasets.

8/28/2024

🏷️

Beyond Primal-Dual Methods in Bandits with Stochastic and Adversarial Constraints

Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Federico Fusco

We address a generalization of the bandit with knapsacks problem, where a learner aims to maximize rewards while satisfying an arbitrary set of long-term constraints. Our goal is to design best-of-both-worlds algorithms that perform optimally under both stochastic and adversarial constraints. Previous works address this problem via primal-dual methods, and require some stringent assumptions, namely the Slater's condition, and in adversarial settings, they either assume knowledge of a lower bound on the Slater's parameter, or impose strong requirements on the primal and dual regret minimizers such as requiring weak adaptivity. We propose an alternative and more natural approach based on optimistic estimations of the constraints. Surprisingly, we show that estimating the constraints with an UCB-like approach guarantees optimal performances. Our algorithm consists of two main components: (i) a regret minimizer working on emph{moving strategy sets} and (ii) an estimate of the feasible set as an optimistic weighted empirical mean of previous samples. The key challenge in this approach is designing adaptive weights that meet the different requirements for stochastic and adversarial constraints. Our algorithm is significantly simpler than previous approaches, and has a cleaner analysis. Moreover, ours is the first best-of-both-worlds algorithm providing bounds logarithmic in the number of constraints. Additionally, in stochastic settings, it provides $widetilde O(sqrt{T})$ regret emph{without} Slater's condition.

5/28/2024

✅

Delay as Payoff in MAB

Ofir Schlisselberg, Ido Cohen, Tal Lancewicki, Yishay Mansour

In this paper, we investigate a variant of the classical stochastic Multi-armed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent's cost); or a user's time spent on a web page given a choice of content (where delay serves as the agent's reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales as $sum_{i:Delta_i > 0}frac{log T}{Delta_i} + d^*$, where $T$ is the maximal number of steps, $Delta_i$ are the sub-optimality gaps and $d^*$ is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret of $sum_{i:Delta_i > 0}frac{log T}{Delta_i} + bar{d}$, where $bar d$ is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales as $sum_{i:Delta_i > 0}frac{log T}{Delta_i} + D$, where $D$ is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation.

8/28/2024