Rotting Infinitely Many-armed Bandits beyond the Worst-case Rotting: An Adaptive Approach

Read original: arXiv:2404.14202 - Published 5/27/2024 by Jung-hun Kim, Milan Vojnovic, Se-Young Yun

📊

Overview

This study explores the challenge of "infinitely many armed bandit problems" in "rotting environments" where the average reward of an arm may decrease over time.
The researchers propose an algorithm that uses UCB (Upper Confidence Bound) with an adaptive sliding window to manage the bias and variance trade-off caused by decreasing rewards.
The algorithm is evaluated on both "slow-rotting" and "abrupt-rotting" scenarios, and achieves tight regret bounds for both cases.

Plain English Explanation

The study looks at a problem called the "infinitely many armed bandit problem" in environments where the potential rewards from different options (called "arms") can decrease over time. This is like a casino game where the payouts from certain slot machines slowly get worse over repeated plays.

To tackle this challenge, the researchers developed a new algorithm that uses a technique called "Upper Confidence Bound" (UCB) along with an "adaptive sliding window." This helps the algorithm balance the need to explore new options (to find the best one) with exploiting the currently known best option, while also accounting for the decreasing rewards.

The algorithm was tested in two different scenarios: one where the overall decrease in rewards is gradual ("slow-rotting"), and another where the decrease happens more abruptly ("abrupt-rotting"). In both cases, the new algorithm was able to achieve strong performance, meaning it could maximize the total rewards earned over time.

Technical Explanation

The study considers infinitely many armed bandit problems in "rotting environments," where the mean reward of an arm may decrease with each pull, while otherwise remaining unchanged. Two scenarios are explored to capture different problem-dependent characteristics regarding the decay of rewards:

The "slow-rotting" scenario, where the cumulative amount of rotting is bounded by a value $V_T$.
The "abrupt-rotting" scenario, where the number of rotting instances is bounded by a value $S_T$.

To address the challenge posed by decreasing rewards, the researchers introduce an algorithm that utilizes UCB with an adaptive sliding window. This is designed to manage the bias and variance trade-off arising due to the rotting rewards. The proposed algorithm achieves tight regret bounds for both the slow and abrupt rotting scenarios.

The performance of the algorithms is demonstrated using synthetic datasets.

Critical Analysis

The paper provides a thorough analysis of the proposed algorithm and its performance in the slow-rotting and abrupt-rotting scenarios. However, the authors acknowledge that the study is limited to these specific scenarios and do not consider more complex settings, such as contextual bandits or restless multi-armed bandits.

Additionally, the paper does not provide any real-world applications or empirical evaluations that would help demonstrate the practical relevance and impact of the research. Exploring such applications, such as caching of dynamic content, could be a valuable direction for future work.

Conclusion

This study addresses the challenge of infinitely many armed bandit problems in rotting environments, where the mean reward of an arm may decrease over time. The researchers introduce a novel algorithm that uses UCB with an adaptive sliding window to manage the bias and variance trade-off caused by decreasing rewards. The algorithm is shown to achieve tight regret bounds in both slow-rotting and abrupt-rotting scenarios, suggesting its potential for practical applications in domains where rewards may degrade over time.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Rotting Infinitely Many-armed Bandits beyond the Worst-case Rotting: An Adaptive Approach

Jung-hun Kim, Milan Vojnovic, Se-Young Yun

In this study, we consider the infinitely many-armed bandit problems in a rested rotting setting, where the mean reward of an arm may decrease with each pull, while otherwise, it remains unchanged. We explore two scenarios regarding the rotting of rewards: one in which the cumulative amount of rotting is bounded by $V_T$, referred to as the slow-rotting case, and the other in which the cumulative number of rotting instances is bounded by $S_T$, referred to as the abrupt-rotting case. To address the challenge posed by rotting rewards, we introduce an algorithm that utilizes UCB with an adaptive sliding window, designed to manage the bias and variance trade-off arising due to rotting rewards. Our proposed algorithm achieves tight regret bounds for both slow and abrupt rotting scenarios. Lastly, we demonstrate the performance of our algorithm using numerical experiments.

5/27/2024

🌀

Blocking Bandits

Soumya Basu, Rajat Sen, Sujay Sanghavi, Sanjay Shakkottai

We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically $(1-1/e)$ optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of $c' log T+ omega(log T)$.

7/31/2024

🌿

Imprecise Multi-Armed Bandits

Vanessa Kosoy

We introduce a novel multi-armed bandit framework, where each arm is associated with a fixed unknown credal set over the space of outcomes (which can be richer than just the reward). The arm-to-credal-set correspondence comes from a known class of hypotheses. We then define a notion of regret corresponding to the lower prevision defined by these credal sets. Equivalently, the setting can be regarded as a two-player zero-sum game, where, on each round, the agent chooses an arm and the adversary chooses the distribution over outcomes from a set of options associated with this arm. The regret is defined with respect to the value of game. For certain natural hypothesis classes, loosely analgous to stochastic linear bandits (which are a special case of the resulting setting), we propose an algorithm and prove a corresponding upper bound on regret. We also prove lower bounds on regret for particular special cases.

5/10/2024

Global Rewards in Restless Multi-Armed Bandits

Naveen Raman, Zheyuan Ryan Shi, Fei Fang

Restless multi-armed bandits (RMAB) extend multi-armed bandits so pulling an arm impacts future states. Despite the success of RMABs, a key limiting assumption is the separability of rewards into a sum across arms. We address this deficiency by proposing restless-multi-armed bandit with global rewards (RMAB-G), a generalization of RMABs to global non-separable rewards. To solve RMAB-G, we develop the Linear- and Shapley-Whittle indices, which extend Whittle indices from RMABs to RMAB-Gs. We prove approximation bounds but also point out how these indices could fail when reward functions are highly non-linear. To overcome this, we propose two sets of adaptive policies: the first computes indices iteratively, and the second combines indices with Monte-Carlo Tree Search (MCTS). Empirically, we demonstrate that our proposed policies outperform baselines and index-based policies with synthetic data and real-world data from food rescue.

6/11/2024