Low-rank Matrix Bandits with Heavy-tailed Rewards

Read original: arXiv:2404.17709 - Published 4/30/2024 by Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee

🗣️

Overview

This paper explores a new problem called "low-rank matrix bandit with heavy-tailed rewards" (LowHTR), which relaxes the strict assumption of sub-Gaussian noise in prior studies.
The authors propose a novel algorithm called LOTUS that can achieve the same regret bound as the state-of-the-art algorithms under sub-Gaussian noise, without knowing the time horizon T.
They also establish a lower bound for the LowHTR problem, showing that LOTUS is nearly optimal in the order of T.
Additionally, the authors improve LOTUS to not require knowledge of the rank r, and show its efficiency in high-dimensional scenarios.

Plain English Explanation

In the field of multi-armed bandits, researchers often study how an agent can maximize their rewards by selecting the best "arm" (or action) from a set of options, without knowing the underlying reward distributions. One specific problem is the "low-rank matrix bandit," where the expected reward of each arm is related to an unknown low-rank parameter matrix.

Previous studies on this problem have assumed that the rewards are mixed with sub-Gaussian noise, meaning the tails of the reward distributions decay exponentially. However, in real-world scenarios, the reward distributions may have "heavier" tails, where the tails decay more slowly.

In this paper, the authors introduce the "low-rank matrix bandit with heavy-tailed rewards" (LowHTR) problem, which relaxes the sub-Gaussian assumption. They propose a new algorithm called LOTUS that can achieve the same regret (i.e., the difference between the optimal reward and the agent's cumulative reward) as state-of-the-art algorithms under sub-Gaussian noise, without requiring knowledge of the time horizon.

The authors also prove a lower bound for the LowHTR problem, showing that LOTUS is nearly optimal in terms of the time dependence. Additionally, they improve LOTUS to not require knowledge of the rank of the parameter matrix, and demonstrate its efficiency in high-dimensional scenarios.

Technical Explanation

In the low-rank matrix bandit problem, the expected reward of each arm (i.e., action) is equal to the inner product between its feature matrix and an unknown low-rank parameter matrix Θ^*. Prior studies have assumed that the rewards are mixed with sub-Gaussian noise, meaning the tails of the reward distributions decay exponentially.

In this work, the authors introduce the "low-rank matrix bandit with heavy-tailed rewards" (LowHTR) problem, where the rewards only have finite (1+δ) moments for some δ in (0,1]. This relaxes the strict sub-Gaussian assumption and allows for "heavier" tailed reward distributions.

To address the LowHTR problem, the authors propose a novel algorithm called LOTUS. LOTUS utilizes truncation on observed payoffs and dynamic exploration to achieve a regret bound of order

O~(d^(3/2)r^(1/2)T^(1/(1+δ))/D~

^(r,r))_, where d is the feature dimension, r is the rank of Θ^*, T is the time horizon, and D~(r,r) is a parameter related to the low-rank structure. Importantly, this regret bound matches the state-of-the-art results under sub-Gaussian noise (i.e., δ=1), without requiring knowledge of T.

Furthermore, the authors establish a lower bound of order

Ω(d^(δ/(1+δ)) r^(δ/(1+δ)) T^(1/(1+δ)))

for the LowHTR problem, indicating that LOTUS is nearly optimal in the order of T.

The authors also improve LOTUS to not require knowledge of the rank r, achieving a regret bound of

O~(dr^(3/2)T^((1+δ)/(1+2δ)))

, and demonstrate its efficiency in high-dimensional scenarios through simulations.

Critical Analysis

The authors have made a significant contribution by relaxing the sub-Gaussian assumption in the low-rank matrix bandit problem and introducing the more general LowHTR problem. This is an important step towards developing bandit algorithms that can handle a wider range of real-world scenarios, where the reward distributions may have heavier tails.

The LOTUS algorithm proposed in the paper is a novel approach that can achieve the state-of-the-art regret bounds without requiring knowledge of the time horizon T. This is a particularly useful feature, as the time horizon may not always be known in practical applications.

One potential limitation of the research is that the authors have not explored the performance of LOTUS in more complex or realistic scenarios, such as those involving adversarial bandits or local privacy constraints. It would be interesting to see how the algorithm handles these additional challenges.

Additionally, while the authors have established a lower bound for the LowHTR problem, it would be helpful to understand the tightness of this bound and whether there is room for further improvements in the regret bounds.

Overall, this paper represents a valuable contribution to the field of multi-armed bandits, particularly in the context of low-rank matrix bandits with heavy-tailed rewards. The LOTUS algorithm and the theoretical insights provided in the paper pave the way for future research in this area.

Conclusion

This paper introduces a new problem called "low-rank matrix bandit with heavy-tailed rewards" (LowHTR), which relaxes the strict sub-Gaussian assumption in prior studies on low-rank matrix bandits. The authors propose a novel algorithm called LOTUS that can achieve the same regret bound as state-of-the-art algorithms under sub-Gaussian noise, without requiring knowledge of the time horizon.

The authors also establish a lower bound for the LowHTR problem, showing that LOTUS is nearly optimal in the order of the time horizon. Furthermore, they improve LOTUS to not require knowledge of the rank of the parameter matrix, and demonstrate its efficiency in high-dimensional scenarios.

This research represents an important step towards developing more robust and practical bandit algorithms that can handle a wider range of real-world scenarios, where the reward distributions may have heavier tails. The insights and techniques presented in this paper can inspire further advancements in the field of multi-armed bandits and their applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

Low-rank Matrix Bandits with Heavy-tailed Rewards

Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee

In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown $d_1$ by $d_2$ low-rank parameter matrix $Theta^*$ with rank $r ll d_1wedge d_2$. While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of underline{low}-rank matrix bandit with underline{h}eavy-underline{t}ailed underline{r}ewards (LowHTR), where the rewards only have finite $(1+delta)$ moment for some $delta in (0,1]$. By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS attaining the regret bound of order $tilde O(d^frac{3}{2}r^frac{1}{2}T^frac{1}{1+delta}/tilde{D}_{rr})$ without knowing $T$, which matches the state-of-the-art regret bound under sub-Gaussian noises~citep{lu2021low,kang2022efficient} with $delta = 1$. Moreover, we establish a lower bound of the order $Omega(d^frac{delta}{1+delta} r^frac{delta}{1+delta} T^frac{1}{1+delta}) = Omega(T^frac{1}{1+delta})$ for LowHTR, which indicates our LOTUS is nearly optimal in the order of $T$. In addition, we improve LOTUS so that it does not require knowledge of the rank $r$ with $tilde O(dr^frac{3}{2}T^frac{1+delta}{1+2delta})$ regret bound, and it is efficient under the high-dimensional scenario. We also conduct simulations to demonstrate the practical superiority of our algorithm.

4/30/2024

Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery

Yassir Jedra, William R'eveillard, Stefan Stojanovic, Alexandre Proutiere

We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)in [m]times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an $varepsilon$-optimal policy with probability at least $1-delta$ typically scales as ${r(m+n)over varepsilon^2}log(1/delta)$. Our regret minimization algorithm enjoys minimax guarantees typically scaling as $r^{7/4}(m+n)^{3/4}sqrt{T}$, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly $r(m+n)$ and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.

7/8/2024

💬

Symmetric Linear Bandits with Hidden Symmetry

Nam Phuong Tran, The Anh Ta, Debmalya Mandal, Long Tran-Thanh

High-dimensional linear bandits with low-dimensional structure have received considerable attention in recent studies due to their practical significance. The most common structure in the literature is sparsity. However, it may not be available in practice. Symmetry, where the reward is invariant under certain groups of transformations on the set of arms, is another important inductive bias in the high-dimensional case that covers many standard structures, including sparsity. In this work, we study high-dimensional symmetric linear bandits where the symmetry is hidden from the learner, and the correct symmetry needs to be learned in an online setting. We examine the structure of a collection of hidden symmetry and provide a method based on model selection within the collection of low-dimensional subspaces. Our algorithm achieves a regret bound of $ O(d_0^{1/3} T^{2/3} log(d))$, where $d$ is the ambient dimension which is potentially very large, and $d_0$ is the dimension of the true low-dimensional subspace such that $d_0 ll d$. With an extra assumption on well-separated models, we can further improve the regret to $ O(d_0sqrt{Tlog(d)} )$.

5/24/2024

🧠

Restless Linear Bandits

Azadeh Khaleghi

A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown $mathbb{R}^d$-valued stationary $varphi$-mixing sequence of parameters $(theta_t,~t in mathbb{N})$ which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the $varphi$-dependence between consecutive $theta_t$. An optimistic algorithm, called LinMix-UCB, is proposed for the case where $theta_t$ has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of $mathcal{O}left(sqrt{d nmathrm{polylog}(n) }right)$ with respect to an oracle that always plays a multiple of $mathbb{E}theta_t$. The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of $mathbb{E}theta_t$.

5/20/2024