Blocking Bandits

Read original: arXiv:1907.11975 - Published 7/31/2024 by Soumya Basu, Rajat Sen, Sujay Sanghavi, Sanjay Shakkottai

🌀

Overview

The paper explores a novel stochastic multi-armed bandit setting where playing an arm makes it unavailable for a fixed number of time slots afterwards.
This models situations where reusing an arm too often is undesirable or infeasible, such as making the same product recommendation repeatedly or scheduling compute jobs on machines.
The authors analyze the complexity of the problem and propose algorithms to optimize cumulative reward in this setting.

Plain English Explanation

The paper looks at a new kind of multi-armed bandit problem. In a standard multi-armed bandit, you have a set of "arms" (options) you can pull, each with a different probability of giving you a reward. The goal is to maximize your total reward over time by figuring out which arms are most rewarding.

In this new setting, playing an arm makes it unavailable for a fixed number of time slots afterwards. This could model a real-world scenario like making product recommendations - you don't want to recommend the same product over and over to the same customer. Or it could model scheduling compute jobs on machines, where a machine is unavailable for a period after running a job.

The key challenge is that you have to balance exploiting the most rewarding arms while also exploring to find new rewarding arms that haven't been played recently. The paper analyzes the theoretical complexity of this problem and proposes algorithms to tackle it effectively.

Technical Explanation

The paper shows that with full knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward is computationally intractable. The authors prove this by mapping the problem to the PINWHEEL scheduling problem, which is known to be NP-hard.

However, the authors then show that a simple greedy algorithm, which plays the available arm with the highest reward, is asymptotically (1-1/e) optimal. This means it achieves at least 63% of the optimal reward, which is a strong guarantee.

When the rewards are unknown, the authors design a UCB (Upper Confidence Bound) based algorithm that achieves bounded regret compared to the greedy algorithm. Regret measures how much reward is lost compared to the optimal strategy. The algorithm leverages the "free exploration" provided by the unavailability of arms - when an arm is unavailable, the algorithm can explore other arms without sacrificing reward.

Finally, the authors consider the case where all the delays are equal. In this setting, the problem reduces to a Combinatorial Semi-bandit problem, for which they provide a lower bound on the achievable regret.

Critical Analysis

The paper tackles an interesting and practical extension of the multi-armed bandit problem. The theoretical analysis of the problem complexity and the design of near-optimal algorithms are both strong contributions.

One potential limitation is that the paper assumes the rewards and delays of the arms are known in advance. In many real-world scenarios, this information may not be available, and the algorithms would need to learn it from experience. The authors do address the unknown rewards case, but further research could investigate more realistic settings with partial information.

Additionally, the paper focuses on the objective of maximizing cumulative reward. In some applications, other objectives such as fairness or exploration may be more relevant. Extending the framework to handle multiple, potentially conflicting objectives could be an interesting direction for future work.

Overall, this paper presents a novel multi-armed bandit setting and provides valuable theoretical insights and algorithmic techniques that could inform the design of real-world systems facing similar constraints.

Conclusion

This paper introduces a new stochastic multi-armed bandit problem where playing an arm makes it unavailable for a fixed number of time slots. The authors analyze the computational complexity of the problem and propose efficient algorithms to optimize cumulative reward in this setting.

The key contributions are:

Showing the intractability of the problem with full information
Providing a simple greedy algorithm with strong asymptotic guarantees
Designing a UCB-based algorithm that achieves bounded regret under unknown rewards
Establishing a lower bound for the special case of equal delays

These results offer valuable insights for developing multi-armed bandit systems that need to account for constraints on arm usage, with applications in recommendation systems, job scheduling, and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌀

Blocking Bandits

Soumya Basu, Rajat Sen, Sujay Sanghavi, Sanjay Shakkottai

We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically $(1-1/e)$ optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of $c' log T+ omega(log T)$.

7/31/2024

🧠

When is exponential asymptotic optimality achievable in average-reward restless bandits?

Yige Hong, Qiaomin Xie, Yudong Chen, Weina Wang

We consider the discrete-time infinite-horizon average-reward restless bandit problem. We propose a novel policy that maintains two dynamic subsets of arms: one subset of arms has a nearly optimal state distribution and takes actions according to an Optimal Local Control routine; the other subset of arms is driven towards the optimal state distribution and gradually merged into the first subset. We show that our policy is asymptotically optimal with an $O(exp(-C N))$ optimality gap for an $N$-armed problem, under the mild assumptions of aperiodic-unichain, non-degeneracy, and local stability. Our policy is the first to achieve exponential asymptotic optimality under the above set of easy-to-verify assumptions, whereas prior work either requires a strong Global Attractor assumption or only achieves an $O(1/sqrt{N})$ optimality gap. We further discuss the fundamental obstacles in significantly weakening our assumptions. In particular, we prove a lower bound showing that local stability is fundamental for exponential asymptotic optimality.

5/29/2024

📊

Rotting Infinitely Many-armed Bandits beyond the Worst-case Rotting: An Adaptive Approach

Jung-hun Kim, Milan Vojnovic, Se-Young Yun

In this study, we consider the infinitely many-armed bandit problems in a rested rotting setting, where the mean reward of an arm may decrease with each pull, while otherwise, it remains unchanged. We explore two scenarios regarding the rotting of rewards: one in which the cumulative amount of rotting is bounded by $V_T$, referred to as the slow-rotting case, and the other in which the cumulative number of rotting instances is bounded by $S_T$, referred to as the abrupt-rotting case. To address the challenge posed by rotting rewards, we introduce an algorithm that utilizes UCB with an adaptive sliding window, designed to manage the bias and variance trade-off arising due to rotting rewards. Our proposed algorithm achieves tight regret bounds for both slow and abrupt rotting scenarios. Lastly, we demonstrate the performance of our algorithm using numerical experiments.

5/27/2024

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

Ziqun Chen, Kechao Cai, Zhuoyue Chen, Jinbei Zhang, John C. S. Lui

We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.

7/30/2024