Bandits with Preference Feedback: A Stackelberg Game Perspective

2406.16745

Published 6/26/2024 by Barna P'asztor, Parnian Kassraie, Andreas Krause

Bandits with Preference Feedback: A Stackelberg Game Perspective

Abstract

Bandits with preference feedback present a powerful tool for optimizing unknown target functions when only pairwise comparisons are allowed instead of direct value queries. This model allows for incorporating human feedback into online inference and optimization and has been employed in systems for fine-tuning large language models. The problem is well understood in simplified settings with linear target functions or over finite small domains that limit practical interest. Taking the next step, we consider infinite domains and nonlinear (kernelized) rewards. In this setting, selecting a pair of actions is quite challenging and requires balancing exploration and exploitation at two levels: within the pair, and along the iterations of the algorithm. We propose MAXMINLCB, which emulates this trade-off as a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards. MAXMINLCB consistently outperforms existing algorithms and satisfies an anytime-valid rate-optimal regret guarantee. This is due to our novel preference-based confidence sequences for kernelized logistic estimators.

Create account to get full access

Overview

This paper introduces a new problem formulation called "Bandits with Preference Feedback", where a leader (e.g., a recommender system) interacts with a follower (e.g., a user) in a Stackelberg game setting.
The leader selects an action (e.g., a recommendation) and the follower provides preference feedback (e.g., likes/dislikes) on the chosen action.
The goal is for the leader to learn the follower's preferences over time and make optimal recommendations that maximize their own utility.
The authors propose several algorithms to solve this problem and provide theoretical regret bounds.

Plain English Explanation

In this paper, the authors consider a scenario where there is a "leader" (e.g., a recommender system) and a "follower" (e.g., a user). The leader wants to recommend things to the follower, but the follower has their own preferences that the leader doesn't know about.

The way this works is, the leader makes a recommendation, and the follower provides some feedback on whether they liked or disliked the recommendation. Over time, the leader can use this feedback to learn the follower's preferences and make better recommendations in the future.

The authors propose several different algorithms that the leader can use to learn the follower's preferences and make good recommendations. They also provide mathematical analysis to show that these algorithms perform well and don't make too many mistakes.

This problem setup is interesting because it captures the interactive nature of many real-world recommendation systems, where the system makes suggestions and the user provides feedback, allowing the system to learn and improve over time. The Strategic Linear Contextual Bandits paper also explores a related problem in a Stackelberg game setting.

Technical Explanation

The authors formulate the "Bandits with Preference Feedback" problem, where a leader (e.g., a recommender system) interacts with a follower (e.g., a user) in a Stackelberg game setting. At each round, the leader selects an action (e.g., a recommendation) and the follower provides preference feedback (e.g., likes/dislikes) on the chosen action.

The leader's goal is to learn the follower's preferences over time and make optimal recommendations that maximize their own utility. The authors propose several algorithms to solve this problem, including:

Leader-Follower-Bandit (LFB): The leader maintains a model of the follower's preferences and selects actions to maximize their own utility, while the follower best-responds to the leader's actions.
Leader-Follower-UCB (LF-UCB): An Upper Confidence Bound (UCB) algorithm for the leader that explores the follower's preferences and exploits the learned model.
Leader-Follower-Thompson-Sampling (LF-TS): A Thompson Sampling algorithm for the leader that maintains a posterior distribution over the follower's preferences.

The authors provide theoretical regret bounds for these algorithms, showing that they can achieve near-optimal performance compared to an oracle with full knowledge of the follower's preferences. The Near-Optimal Regret for Linear MDPs in Aggregate Bandit and Leveraging Offline Data for Linear Latent Bandits papers explore related bandit problems with side information and offline data, respectively.

Critical Analysis

The paper presents a novel and interesting problem formulation that captures the interactive nature of many real-world recommendation systems. The authors' proposed algorithms and theoretical analysis provide valuable insights and a solid foundation for further research in this area.

One potential limitation is that the paper assumes a specific model of the follower's preferences, which may not always align with reality. In practice, user preferences can be more complex and may not be well-described by the linear model used in this work. Additionally, the paper assumes the follower is a rational agent who best-responds to the leader's actions, which may not always hold true for human users.

Further research could explore relaxing some of these assumptions, such as considering more flexible preference models or incorporating behavioral insights about human decision-making. The Incentive Compatible Bandits with Importance Weighting paper also explores related issues of incentive compatibility in bandit problems.

Overall, this paper makes a valuable contribution to the field of interactive decision-making and provides a promising starting point for further exploration of Bandits with Preference Feedback.

Conclusion

This paper introduces a new problem formulation called "Bandits with Preference Feedback", where a leader (e.g., a recommender system) interacts with a follower (e.g., a user) in a Stackelberg game setting. The leader selects actions (recommendations) and the follower provides preference feedback, allowing the leader to learn the follower's preferences over time and make optimal recommendations.

The authors propose several algorithms to solve this problem and provide theoretical regret bounds, demonstrating the potential for these techniques to improve the performance of real-world recommendation systems. While the paper makes some simplifying assumptions, it presents an interesting and relevant problem that deserves further exploration, particularly in terms of relaxing model assumptions and incorporating insights from human decision-making research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏅

Cascading Reinforcement Learning

Yihan Du, R. Srikant, Wei Chen

Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.

4/9/2024

cs.LG

Leveraging Offline Data in Linear Latent Bandits

Chinmaya Kausik, Kevin Tan, Ambuj Tewari

Sequential decision-making domains such as recommender systems, healthcare and education often have unobserved heterogeneity in the population that can be modeled using latent bandits $-$ a framework where an unobserved latent state determines the model for a trajectory. While the latent bandit framework is compelling, the extent of its generality is unclear. We first address this by establishing a de Finetti theorem for decision processes, and show that $textit{every}$ exchangeable and coherent stateless decision process is a latent bandit. The latent bandit framework lends itself particularly well to online learning with offline datasets, a problem of growing interest in sequential decision-making. One can leverage offline latent bandit data to learn a complex model for each latent state, so that an agent can simply learn the latent state online to act optimally. We focus on a linear model for a latent bandit with $d_A$-dimensional actions, where the latent states lie in an unknown $d_K$-dimensional subspace for $d_K ll d_A$. We present SOLD, a novel principled method to learn this subspace from short offline trajectories with guarantees. We then provide two methods to leverage this subspace online: LOCAL-UCB and ProBALL-UCB. We demonstrate that LOCAL-UCB enjoys $tilde O(min(d_Asqrt{T}, d_Ksqrt{T}(1+sqrt{d_AT/d_KN})))$ regret guarantees, where the effective dimension is lower when the size $N$ of the offline dataset is larger. ProBALL-UCB enjoys a slightly weaker guarantee, but is more practical and computationally efficient. Finally, we establish the efficacy of our methods using experiments on both synthetic data and real-life movie recommendation data from MovieLens.

5/28/2024

cs.LG cs.AI stat.ML

🚀

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Asaf Cassel, Haipeng Luo, Aviv Rosenberg, Dmitry Sotnikov

In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback when an episode ends. To this end, we study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. Prior work studied RL-ABF only in tabular settings, where the number of states is assumed to be small. In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble.

5/15/2024

cs.LG

🧠

Strategic Linear Contextual Bandits

Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, Haifeng Xu

Motivated by the phenomenon of strategic agents gaming a recommender system to maximize the number of times they are recommended to users, we study a strategic variant of the linear contextual bandit problem, where the arms can strategically misreport their privately observed contexts to the learner. We treat the algorithm design problem as one of mechanism design under uncertainty and propose the Optimistic Grim Trigger Mechanism (OptGTM) that incentivizes the agents (i.e., arms) to report their contexts truthfully while simultaneously minimizing regret. We also show that failing to account for the strategic nature of the agents results in linear regret. However, a trade-off between mechanism design and regret minimization appears to be unavoidable. More broadly, this work aims to provide insight into the intersection of online learning and mechanism design.

6/4/2024

cs.LG cs.GT