Best Arm Identification for Stochastic Rising Bandits

Read original: arXiv:2302.07510 - Published 5/29/2024 by Marco Mussi, Alessandro Montenegro, Francesco Trov'o, Marcello Restelli, Alberto Maria Metelli

🛸

Overview

The paper introduces "Stochastic Rising Bandits" (SRBs), which model sequential decision-making problems where the expected reward of available options increases over time as they are selected.
This setting captures scenarios like online best model selection, where options (e.g., machine learning models) improve in performance as they are used.
The paper focuses on the "Best Arm Identification" (BAI) problem for SRBs, where the goal is to identify the best option given a fixed budget of rounds.
Two algorithms are proposed: R-UCBE, which uses a UCB-like approach, and R-SR, which employs a successive reject procedure.
The paper provides theoretical guarantees on the probability of correctly identifying the optimal option and the simple regret, and also derives a lower bound on the error probability.
The proposed algorithms are validated on both synthetic and realistic environments.

Plain English Explanation

The paper examines a type of decision-making problem called "Stochastic Rising Bandits" (SRBs). In these problems, the available options (like different machine learning models) get better over time as they are used more. For example, an online system might be choosing between several models to deploy, and the models improve in performance as they are selected and tested.

The specific problem the paper focuses on is called "Best Arm Identification" (BAI). In this setting, you have a fixed budget of rounds (e.g., a limited number of times you can test the models), and at the end, you need to recommend the single best option.

The paper proposes two algorithms to solve this BAI problem for SRBs. The first, called R-UCBE, uses an approach similar to the popular "Upper Confidence Bound" (UCB) algorithm. The second, called R-SR, employs a "successive reject" procedure.

The key contributions of the paper are:

Proving that with a large enough budget, the proposed algorithms can accurately identify the truly optimal option with high probability.
Deriving a lower bound on the error probability, showing that a large budget is fundamentally necessary in the SRB setting.
Validating the algorithms on both synthetic data and real-world problems, demonstrating their practical effectiveness.

The main significance is that the paper provides principled, provably-correct techniques for identifying the best option in settings where the available choices are continuously improving, which is an important real-world scenario.

Technical Explanation

The paper introduces the "Stochastic Rising Bandits" (SRB) model, which captures sequential decision-making problems where the expected reward of the available options increases each time they are selected. This setting is motivated by scenarios like online best model selection, where machine learning models improve in performance as they are used more.

While previous works have focused on regret minimization in SRBs, this paper tackles the "Best Arm Identification" (BAI) problem. In the BAI setting, the goal is to identify the single best option at the end of a fixed budget of rounds, rather than minimizing cumulative regret.

The authors propose two algorithms to solve the BAI problem for SRBs:

R-UCBE: This algorithm uses a UCB-like approach, maintaining upper confidence bounds on the expected rewards of each option and selecting the option with the highest bound.
R-SR: This algorithm employs a successive reject procedure, iteratively eliminating the worst-performing option until only the estimated best option remains.

The paper provides theoretical guarantees for these algorithms. Specifically, it shows that with a sufficiently large budget, both R-UCBE and R-SR can achieve high probability of correctly identifying the optimal option, and they also provide bounds on the simple regret (the difference between the expected reward of the recommended option and the true optimal option).

Furthermore, the authors derive a fundamental lower bound on the error probability for the BAI problem in the SRB setting. This lower bound demonstrates that a large enough budget is necessary for accurate identification, as the information gain about the options' rewards becomes more challenging over time.

The proposed algorithms are evaluated on both synthetic data and real-world problems, such as graph-feedback bandits and imprecise multi-armed bandits, where they are shown to outperform baseline approaches.

Critical Analysis

The paper makes several important contributions to the literature on sequential decision-making and best arm identification. The introduction of the SRB model captures a realistic and relevant scenario, and the proposed algorithms provide principled and provably-correct solutions for the BAI problem in this setting.

One potential limitation is that the theoretical guarantees rely on the assumption of a sufficiently large budget. While the authors demonstrate the necessity of this assumption through the derived lower bound, the practical implications of this requirement are not fully explored. It would be interesting to understand how the budget size affects the empirical performance of the algorithms, and whether there are ways to mitigate the impact of a limited budget.

Additionally, the paper focuses on the single-player, non-adversarial setting. It would be valuable to investigate extensions of the SRB model and the proposed algorithms to more complex scenarios, such as adversarial restless multi-armed bandits or Bayes-optimal algorithms for frequentist best arm identification.

Overall, the paper presents a novel and insightful take on the best arm identification problem, with robust theoretical foundations and promising practical results. The work opens up several avenues for further research and exploration in the field of sequential decision-making under uncertainty.

Conclusion

The paper introduces the "Stochastic Rising Bandits" (SRB) model, which captures sequential decision-making problems where the available options improve in expected reward as they are selected. The authors focus on the Best Arm Identification (BAI) problem in this setting, where the goal is to identify the single best option given a fixed budget of rounds.

To address this problem, the paper proposes two algorithms: R-UCBE, which uses a UCB-like approach, and R-SR, which employs a successive reject procedure. The authors prove that these algorithms can accurately identify the optimal option with high probability, given a sufficiently large budget, and also derive a lower bound on the error probability.

The practical significance of this work lies in its ability to provide principled solutions for identifying the best option in scenarios where the available choices are continuously improving, such as online model selection or graph-feedback bandits. The empirical validation of the proposed algorithms on both synthetic and realistic environments further demonstrates their effectiveness.

Overall, this paper contributes valuable theoretical insights and practical techniques for sequential decision-making problems, and it opens up avenues for future research to extend the SRB model and explore more complex settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

Best Arm Identification for Stochastic Rising Bandits

Marco Mussi, Alessandro Montenegro, Francesco Trov'o, Marcello Restelli, Alberto Maria Metelli

Stochastic Rising Bandits (SRBs) model sequential decision-making problems in which the expected reward of the available options increases every time they are selected. This setting captures a wide range of scenarios in which the available options are learning entities whose performance improves (in expectation) over time (e.g., online best model selection). While previous works addressed the regret minimization problem, this paper focuses on the fixed-budget Best Arm Identification (BAI) problem for SRBs. In this scenario, given a fixed budget of rounds, we are asked to provide a recommendation about the best option at the end of the identification process. We propose two algorithms to tackle the above-mentioned setting, namely R-UCBE, which resorts to a UCB-like approach, and R-SR, which employs a successive reject procedure. Then, we prove that, with a sufficiently large budget, they provide guarantees on the probability of properly identifying the optimal option at the end of the learning process and on the simple regret. Furthermore, we derive a lower bound on the error probability, matched by our R-SR (up to constants), and illustrate how the need for a sufficiently large budget is unavoidable in the SRB setting. Finally, we numerically validate the proposed algorithms in both synthetic and realistic environments.

5/29/2024

🔎

New!Best Arm Identification with Minimal Regret

Junwen Yang, Vincent Y. F. Tan, Tianyuan Jin

Motivated by real-world applications that necessitate responsible experimentation, we introduce the problem of best arm identification (BAI) with minimal regret. This innovative variant of the multi-armed bandit problem elegantly amalgamates two of its most ubiquitous objectives: regret minimization and BAI. More precisely, the agent's goal is to identify the best arm with a prescribed confidence level $delta$, while minimizing the cumulative regret up to the stopping time. Focusing on single-parameter exponential families of distributions, we leverage information-theoretic techniques to establish an instance-dependent lower bound on the expected cumulative regret. Moreover, we present an intriguing impossibility result that underscores the tension between cumulative regret and sample complexity in fixed-confidence BAI. Complementarily, we design and analyze the Double KL-UCB algorithm, which achieves asymptotic optimality as the confidence level tends to zero. Notably, this algorithm employs two distinct confidence bounds to guide arm selection in a randomized manner. Our findings elucidate a fresh perspective on the inherent connections between regret minimization and BAI.

9/30/2024

UCB Exploration for Fixed-Budget Bayesian Best Arm Identification

Rong J. B. Zhu, Yanqi Qiu

We study best-arm identification (BAI) in the fixed-budget setting. Adaptive allocations based on upper confidence bounds (UCBs), such as UCBE, are known to work well in BAI. However, it is well-known that its optimal regret is theoretically dependent on instances, which we show to be an artifact in many fixed-budget BAI problems. In this paper we propose an UCB exploration algorithm that is both theoretically and empirically efficient for the fixed budget BAI problem under a Bayesian setting. The key idea is to learn prior information, which can enhance the performance of UCB-based BAI algorithm as it has done in the cumulative regret minimization problem. We establish bounds on the failure probability and the simple regret for the Bayesian BAI problem, providing upper bounds of order $tilde{O}(sqrt{K/n})$, up to logarithmic factors, where $n$ represents the budget and $K$ denotes the number of arms. Furthermore, we demonstrate through empirical results that our approach consistently outperforms state-of-the-art baselines.

8/12/2024

Optimizing Sharpe Ratio: Risk-Adjusted Decision-Making in Multi-Armed Bandits

Sabrina Khurshid, Mohammed Shahid Abdulla, Gourab Ghatak

Sharpe Ratio (SR) is a critical parameter in characterizing financial time series as it jointly considers the reward and the volatility of any stock/portfolio through its variance. Deriving online algorithms for optimizing the SR is particularly challenging since even offline policies experience constant regret with respect to the best expert Even-Dar et al (2006). Thus, instead of optimizing the usual definition of SR, we optimize regularized square SR (RSSR). We consider two settings for the RSSR, Regret Minimization (RM) and Best Arm Identification (BAI). In this regard, we propose a novel multi-armed bandit (MAB) algorithm for RM called UCB-RSSR for RSSR maximization. We derive a path-dependent concentration bound for the estimate of the RSSR. Based on that, we derive the regret guarantees of UCB-RSSR and show that it evolves as O(log n) for the two-armed bandit case played for a horizon n. We also consider a fixed budget setting for well-known BAI algorithms, i.e., sequential halving and successive rejects, and propose SHVV, SHSR, and SuRSR algorithms. We derive the upper bound for the error probability of all proposed BAI algorithms. We demonstrate that UCB-RSSR outperforms the only other known SR optimizing bandit algorithm, U-UCB Cassel et al (2023). We also establish its efficacy with respect to other benchmarks derived from the GRA-UCB and MVTS algorithms. We further demonstrate the performance of proposed BAI algorithms for multiple different setups. Our research highlights that our proposed algorithms will find extensive applications in risk-aware portfolio management problems. Consequently, our research highlights that our proposed algorithms will find extensive applications in risk-aware portfolio management problems.

6/12/2024