Non-Stationary Latent Auto-Regressive Bandits

Read original: arXiv:2402.03110 - Published 8/13/2024 by Anna L. Trella, Walter Dempsey, Finale Doshi-Velez, Susan A. Murphy

✨

Overview

The paper considers the stochastic multi-armed bandit problem, where a decision-maker must choose between multiple "arms" (options) to maximize their rewards over time.
It presents a novel formulation of non-stationarity in the environment, where changes in the mean reward of the arms over time are due to some unknown, latent, auto-regressive (AR) state of order k.
The authors propose an algorithm that can achieve near-optimal regret (the difference between the rewards obtained and the optimal rewards) in this "latent AR bandit" setting, even if the order k is mis-specified.

Plain English Explanation

The stochastic multi-armed bandit problem is a classic problem in machine learning and decision-making, where a decision-maker must repeatedly choose between multiple "arms" (options) in order to maximize their total rewards over time. The typical assumption is that the rewards for each arm are stationary, meaning they don't change over time.

However, in many real-world settings, the rewards may actually change over time in ways that are difficult to predict. This is often the case in emerging fields like behavioral health or education, where there are few mechanistic models of the environment.

The paper introduces a new formulation of this problem, which the authors call the "latent AR bandit." In this setting, the changes in the mean reward of the arms are due to some unknown, latent, auto-regressive (AR) state of order k. This means that the current reward of an arm depends on its past k rewards in a complex, non-linear way.

The authors propose an algorithm that can effectively navigate this non-stationary environment and achieve near-optimal regret, even if the order k is not known in advance. This is a significant advancement, as it allows decision-makers to make good choices in settings where the rewards are constantly changing in unpredictable ways.

Technical Explanation

The paper presents a novel formulation of the non-stationary multi-armed bandit problem, where the changes in the mean reward of the arms over time are driven by some unknown, latent, auto-regressive (AR) state of order k. This is in contrast to previous work, which has typically assumed the rewards are either stationary or follow a known, simple pattern of non-stationarity.

The authors propose an algorithm, called Latent-AR-UCB, that can effectively navigate this "latent AR bandit" environment. The key idea is to maintain an estimate of the latent AR state and use this to guide the arm selection process. Specifically, the algorithm uses an upper confidence bound (UCB) approach, where it chooses the arm with the highest estimated mean reward plus an exploration bonus.

Crucially, the authors show that their algorithm can achieve tilde{O}(ksqrt{T}) regret, where T is the total number of rounds and k is the order of the AR process. This means the algorithm performs nearly as well as the optimal policy, even if the order k is not known in advance.

The authors also provide empirical evaluations of their algorithm across multiple non-stationary environments, demonstrating that it outperforms standard UCB approaches, even when k is mis-specified.

Critical Analysis

The paper presents a novel and interesting formulation of the non-stationary multi-armed bandit problem, which captures the complex, auto-regressive nature of reward changes in many real-world settings. The proposed algorithm, Latent-AR-UCB, is both theoretically sound and empirically effective, suggesting it could be a valuable tool for decision-makers in a variety of applications.

One potential limitation is that the algorithm assumes the latent AR state follows a specific functional form, which may not always be the case in practice. It would be interesting to see how the algorithm performs under more general forms of non-stationarity, or if it could be extended to handle such cases.

Additionally, the paper does not discuss potential issues around the scalability of the algorithm, nor does it consider the computational complexity of maintaining the estimates of the latent AR state. These practical concerns would be important to address before deploying the algorithm in real-world, large-scale applications.

Overall, the paper makes a significant contribution to the field of non-stationary decision-making and provides a promising new approach for navigating complex, dynamic environments. Further research and evaluation could help to solidify the algorithm's performance and applicability in a wider range of settings.

Conclusion

The paper introduces a novel formulation of the non-stationary multi-armed bandit problem, called the "latent AR bandit," where changes in the mean reward of the arms are driven by an unknown, latent, auto-regressive state. The authors propose an effective algorithm, Latent-AR-UCB, that can achieve near-optimal regret in this setting, even when the order of the AR process is not known in advance.

This work represents an important advancement in the field of non-stationary decision-making, with potential applications in a variety of real-world domains where rewards are constantly changing in unpredictable ways. By providing a principled, yet flexible approach for navigating such environments, the paper lays the groundwork for more effective and adaptable decision-making algorithms that can help to improve outcomes in a wide range of contexts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✨

Non-Stationary Latent Auto-Regressive Bandits

Anna L. Trella, Walter Dempsey, Finale Doshi-Velez, Susan A. Murphy

We consider the stochastic multi-armed bandit problem with non-stationary rewards. We present a novel formulation of non-stationarity in the environment where changes in the mean reward of the arms over time are due to some unknown, latent, auto-regressive (AR) state of order $k$. We call this new environment the latent AR bandit. Different forms of the latent AR bandit appear in many real-world settings, especially in emerging scientific fields such as behavioral health or education where there are few mechanistic models of the environment. If the AR order $k$ is known, we propose an algorithm that achieves $tilde{O}(ksqrt{T})$ regret in this setting. Empirically, our algorithm outperforms standard UCB across multiple non-stationary environments, even if $k$ is mis-specified.

8/13/2024

🏷️

Adaptive Smooth Non-Stationary Bandits

Joe Suk

We study a $K$-armed non-stationary bandit model where rewards change smoothly, as captured by H{o}lder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a H{o}lder exponent $beta$ and coefficient $lambda$. While various sub-cases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all $K,beta,lambda$. Next, we show this optimal dynamic regret can be attained adaptively, without knowledge of $beta,lambda$. To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes $betaleq 1$ and $beta=2$ (Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023). Thus, our work resolves open questions raised by these disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in non-stationary bandits. While such rates are long known to be impossible in general (Garivier and Moulines, 2011), we show that environments admitting a safe arm (Suk and Kpotufe, 2022) allow for much faster rates than the worst-case scaling with $sqrt{T}$. While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of non-stationarity where even the logarithmic bounds are pessimistic. We show our new gap-dependent rate is tight and that its achievability (i.e., as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth H{o}lder class model.

7/12/2024

🏅

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Guojun Xiong, Jian Li

Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems under an instantaneous activation constraint that at most B arms can be activated at any decision epoch. Each restless arm is endowed with a state that evolves independently according to a Markov decision process regardless of being activated or not. In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards, which can change arbitrarily across episodes. Further, we consider a challenging but natural bandit feedback setting that only adversarial rewards of activated arms are revealed to the decision maker (DM). The goal of the DM is to maximize its total adversarial rewards during the learning process while the instantaneous activation constraint must be satisfied in each decision epoch. We develop a novel reinforcement learning algorithm with two key contributors: a novel biased adversarial reward estimator to deal with bandit feedback and unknown transitions, and a low-complexity index policy to satisfy the instantaneous activation constraint. We show $tilde{mathcal{O}}(Hsqrt{T})$ regret bound for our algorithm, where $T$ is the number of episodes and $H$ is the episode length. To our best knowledge, this is the first algorithm to ensure $tilde{mathcal{O}}(sqrt{T})$ regret for adversarial RMAB in our considered challenging settings.

5/3/2024

Leveraging Offline Data in Linear Latent Bandits

Chinmaya Kausik, Kevin Tan, Ambuj Tewari

Sequential decision-making domains such as recommender systems, healthcare and education often have unobserved heterogeneity in the population that can be modeled using latent bandits $-$ a framework where an unobserved latent state determines the model for a trajectory. While the latent bandit framework is compelling, the extent of its generality is unclear. We first address this by establishing a de Finetti theorem for decision processes, and show that $textit{every}$ exchangeable and coherent stateless decision process is a latent bandit. The latent bandit framework lends itself particularly well to online learning with offline datasets, a problem of growing interest in sequential decision-making. One can leverage offline latent bandit data to learn a complex model for each latent state, so that an agent can simply learn the latent state online to act optimally. We focus on a linear model for a latent bandit with $d_A$-dimensional actions, where the latent states lie in an unknown $d_K$-dimensional subspace for $d_K ll d_A$. We present SOLD, a novel principled method to learn this subspace from short offline trajectories with guarantees. We then provide two methods to leverage this subspace online: LOCAL-UCB and ProBALL-UCB. We demonstrate that LOCAL-UCB enjoys $tilde O(min(d_Asqrt{T}, d_Ksqrt{T}(1+sqrt{d_AT/d_KN})))$ regret guarantees, where the effective dimension is lower when the size $N$ of the offline dataset is larger. ProBALL-UCB enjoys a slightly weaker guarantee, but is more practical and computationally efficient. Finally, we establish the efficacy of our methods using experiments on both synthetic data and real-life movie recommendation data from MovieLens.

5/28/2024