Linear Contextual Bandits with Hybrid Payoff: Revisited

Read original: arXiv:2406.10131 - Published 9/5/2024 by Nirjhar Das, Gaurav Sinha
Total Score

0

Linear Contextual Bandits with Hybrid Payoff: Revisited

Sign in to get full access

or

If you already have an account, we'll log you in



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Linear Contextual Bandits with Hybrid Payoff: Revisited
Total Score

0

Linear Contextual Bandits with Hybrid Payoff: Revisited

Nirjhar Das, Gaurav Sinha

We study the Linear Contextual Bandit problem in the hybrid reward setting. In this setting every arm's reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can reduce this setting to two closely related settings (a) Shared - no arm specific parameters, and (b) Disjoint - only arm specific parameters, enabling the application of two popular state of the art algorithms - $texttt{LinUCB}$ and $texttt{DisLinUCB}$ (Algorithm 1 in (Li et al. 2010)). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both algorithms, significantly improving on the known regret guarantees of these algorithms. Our novel analysis critically exploits the hybrid reward structure and the diversity condition. Moreover, we introduce a new algorithm $texttt{HyLinUCB}$ that crucially modifies $texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that $texttt{HyLinUCB}$ also incurs only $O(sqrt{T})$ regret for $T$ rounds. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of $texttt{HyLinUCB}$.For number of arm specific parameters much larger than the number of shared parameters, we observe that $texttt{DisLinUCB}$ incurs the lowest regret. In this case, regret of $texttt{HyLinUCB}$ is the second best and extremely competitive to $texttt{DisLinUCB}$. In all other situations, including our real-world dataset, $texttt{HyLinUCB}$ has significantly lower regret than $texttt{LinUCB}$, $texttt{DisLinUCB}$ and other SOTA baselines we considered. We also empirically observe that the regret of $texttt{HyLinUCB}$ grows much slower with the number of arms compared to baselines, making it suitable even for very large action spaces.

Read more

9/5/2024

Contextual Bandits for Unbounded Context Distributions
Total Score

0

Contextual Bandits for Unbounded Context Distributions

Puning Zhao, Jiafei Wu, Zhe Liu, Huiwen Wu

Nonparametric contextual bandit is an important model of sequential decision making problems. Under $alpha$-Tsybakov margin condition, existing research has established a regret bound of $tilde{O}left(T^{1-frac{alpha+1}{d+2}}right)$ for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support is to achieve both exploration-exploitation tradeoff and bias-variance tradeoff simultaneously. In this paper, we solve the nonparametric contextual bandit problem with unbounded contexts. We propose two nearest neighbor methods combined with UCB exploration. The first method uses a fixed $k$. Our analysis shows that this method achieves minimax optimal regret under a weak margin condition and relatively light-tailed context distributions. The second method uses adaptive $k$. By a proper data-driven selection of $k$, this method achieves an expected regret of $tilde{O}left(T^{1-frac{(alpha+1)beta}{alpha+(d+2)beta}}+T^{1-beta}right)$, in which $beta$ is a parameter describing the tail strength. This bound matches the minimax lower bound up to logarithm factors, indicating that the second method is approximately optimal.

Read more

8/20/2024

A Contextual Combinatorial Bandit Approach to Negotiation
Total Score

0

A Contextual Combinatorial Bandit Approach to Negotiation

Yexin Li, Zhancun Mu, Siyuan Qi

Learning effective negotiation strategies poses two key challenges: the exploration-exploitation dilemma and dealing with large action spaces. However, there is an absence of learning-based approaches that effectively address these challenges in negotiation. This paper introduces a comprehensive formulation to tackle various negotiation problems. Our approach leverages contextual combinatorial multi-armed bandits, with the bandits resolving the exploration-exploitation dilemma, and the combinatorial nature handles large action spaces. Building upon this formulation, we introduce NegUCB, a novel method that also handles common issues such as partial observations and complex reward functions in negotiation. NegUCB is contextual and tailored for full-bandit feedback without constraints on the reward functions. Under mild assumptions, it ensures a sub-linear regret upper bound. Experiments conducted on three negotiation tasks demonstrate the superiority of our approach.

Read more

7/2/2024

Causal Contextual Bandits with Adaptive Context
Total Score

0

Causal Contextual Bandits with Adaptive Context

Rahul Madhavan, Aurghya Maiti, Gaurav Sinha, Siddharth Barman

We study a variant of causal contextual bandits where the context is chosen based on an initial intervention chosen by the learner. At the beginning of each round, the learner selects an initial action, depending on which a stochastic context is revealed by the environment. Following this, the learner then selects a final action and receives a reward. Given $T$ rounds of interactions with the environment, the objective of the learner is to learn a policy (of selecting the initial and the final action) with maximum expected reward. In this paper we study the specific situation where every action corresponds to intervening on a node in some known causal graph. We extend prior work from the deterministic context setting to obtain simple regret minimization guarantees. This is achieved through an instance-dependent causal parameter, $lambda$, which characterizes our upper bound. Furthermore, we prove that our simple regret is essentially tight for a large class of instances. A key feature of our work is that we use convex optimization to address the bandit exploration problem. We also conduct experiments to validate our theoretical results, and release our code at our project GitHub repository: https://github.com/adaptiveContextualCausalBandits/aCCB.

Read more

6/4/2024