Feel-Good Thompson Sampling for Contextual Dueling Bandits

Read original: arXiv:2404.06013 - Published 4/10/2024 by Xuheng Li, Heyang Zhao, Quanquan Gu

📊

Overview

Extends classic dueling bandits by incorporating contextual information for decision-making and preference learning
Several algorithms based on the upper confidence bound (UCB) have been proposed, but no algorithm based on posterior sampling has been developed in this setting
Proposes a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits
Introduces a new "Feel-Good" exploration term tailored for dueling bandits
Shows the algorithm achieves nearly minimax-optimal regret
Evaluates the algorithm on synthetic data, outperforming existing algorithms

Plain English Explanation

In the classic dueling bandits problem, a learner compares two options and receives feedback on which one was preferred. The contextual dueling bandits setting extends this by incorporating additional information, or "context," to help the learner make better decisions.

Several existing algorithms for contextual dueling bandits use the upper confidence bound (UCB) approach, which focuses on exploring options that have high potential reward. However, no algorithm has been developed that uses the Thompson sampling technique, which has shown success in traditional contextual bandit problems.

The researchers propose a new Thompson sampling algorithm called FGTS.CDB for linear contextual dueling bandits. The key innovation is a "Feel-Good" exploration term that helps the algorithm make better comparisons between the two options. This allows FGTS.CDB to achieve near-optimal performance, meaning it can learn to make good decisions quickly, even in complex environments.

The researchers evaluate FGTS.CDB on synthetic data and find that it outperforms existing algorithms by a significant margin. This suggests the Thompson sampling approach, combined with the new exploration term, is a promising direction for tackling contextual dueling bandits problems.

Technical Explanation

The researchers propose a Thompson sampling algorithm, named FGTS.CDB, for the linear contextual dueling bandits setting. At the core of their algorithm is a new "Feel-Good" exploration term that leverages the independence of the two selected arms, thereby avoiding a cross term in the analysis.

The algorithm works as follows:

At each round, the learner receives a context vector.
The learner samples two candidate arms (options) from a posterior distribution.
The learner compares the two sampled arms and receives feedback on which one was preferred.
The learner updates the posterior distribution based on the feedback.

The key innovation is the "Feel-Good" exploration term, which encourages the learner to explore options that are likely to be preferred, given the current context. This term helps the algorithm make better comparisons between the two options, leading to faster learning.

The researchers show that their FGTS.CDB algorithm achieves nearly minimax-optimal regret, meaning it can learn to make good decisions quickly, even in complex environments. They evaluate the algorithm on synthetic data and find that it outperforms existing contextual dueling bandits algorithms by a large margin.

Critical Analysis

The researchers have made an interesting contribution to the field of contextual dueling bandits by proposing a novel Thompson sampling algorithm with a specialized exploration term. The theoretical analysis demonstrating the near-optimal regret guarantee is a significant result.

However, the paper does not address some potential limitations and areas for further research. For example, the algorithm assumes a linear model, which may not be suitable for more complex, non-linear relationships between context and preferences. It would be valuable to explore extensions to more general function classes or neural network-based models.

Additionally, the evaluation is limited to synthetic data, and it would be beneficial to test the algorithm on real-world datasets to understand its performance in practical scenarios. Contextual dueling bandits are an important problem with many applications, such as online advertising, recommendation systems, and user interface design, so validating the algorithm's effectiveness in these domains would be a valuable next step.

Overall, the FGTS.CDB algorithm represents an important advance in the field of contextual dueling bandits, but further research is needed to fully explore its capabilities and limitations.

Conclusion

This paper proposes a novel Thompson sampling algorithm, FGTS.CDB, for the linear contextual dueling bandits problem. The key innovation is a "Feel-Good" exploration term that helps the algorithm make better comparisons between the two options, leading to faster learning and near-optimal regret performance.

The results suggest that the Thompson sampling approach, combined with the new exploration term, is a promising direction for tackling contextual dueling bandits problems. While the algorithm is limited to linear models, the paper lays the groundwork for further research into more flexible and powerful approaches for this important class of decision-making problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Feel-Good Thompson Sampling for Contextual Dueling Bandits

Xuheng Li, Heyang Zhao, Quanquan Gu

Contextual dueling bandits, where a learner compares two options based on context and receives feedback indicating which was preferred, extends classic dueling bandits by incorporating contextual information for decision-making and preference learning. Several algorithms based on the upper confidence bound (UCB) have been proposed for linear contextual dueling bandits. However, no algorithm based on posterior sampling has been developed in this setting, despite the empirical success observed in traditional contextual bandits. In this paper, we propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits. At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits. This term leverages the independence of the two selected arms, thereby avoiding a cross term in the analysis. We show that our algorithm achieves nearly minimax-optimal regret, i.e., $tilde{mathcal{O}}(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon. Finally, we evaluate our algorithm on synthetic data and observe that FGTS.CDB outperforms existing algorithms by a large margin.

4/10/2024

Neural Dueling Bandits

Arun Verma, Zhongxiang Dai, Xiaoqiang Lin, Patrick Jaillet, Bryan Kian Hsiang Low

Contextual dueling bandit is used to model the bandit problems, where a learner's goal is to find the best arm for a given context using observed noisy preference feedback over the selected arms for the past contexts. However, existing algorithms assume the reward function is linear, which can be complex and non-linear in many real-life applications like online recommendations or ranking web search results. To overcome this challenge, we use a neural network to estimate the reward function using preference feedback for the previously selected arms. We propose upper confidence bound- and Thompson sampling-based algorithms with sub-linear regret guarantees that efficiently select arms in each round. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution. Experimental results on the problem instances derived from synthetic datasets corroborate our theoretical results.

7/25/2024

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

Qiwei Di, Jiafan He, Quanquan Gu

Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction. To tackle this challenge, we study a specific model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an $tilde O(dsqrt{T}+dC)$ regret bound, where $T$ is the number of rounds, $d$ is the dimension of the context, and $ 0 le C le T$ is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without ($C=0$) adversarial feedback. Additionally, we conduct experiments to evaluate our proposed algorithm against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.

4/17/2024

Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications

Luyue Xu, Liming Wang, Hong Xie, Mingqiang Zhou

Contextual bandits serve as a fundamental algorithmic framework for optimizing recommendation decisions online. Though extensive attention has been paid to tailoring contextual bandits for recommendation applications, the herding effects in user feedback have been ignored. These herding effects bias user feedback toward historical ratings, breaking down the assumption of unbiased feedback inherent in contextual bandits. This paper develops a novel variant of the contextual bandit that is tailored to address the feedback bias caused by the herding effects. A user feedback model is formulated to capture this feedback bias. We design the TS-Conf (Thompson Sampling under Conformity) algorithm, which employs posterior sampling to balance the exploration and exploitation tradeoff. We prove an upper bound for the regret of the algorithm, revealing the impact of herding effects on learning speed. Extensive experiments on datasets demonstrate that TS-Conf outperforms four benchmark algorithms. Analysis reveals that TS-Conf effectively mitigates the negative impact of herding effects, resulting in faster learning and improved recommendation accuracy.

8/29/2024