Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback

2112.02856

Published 4/1/2024 by Wenjia Ba, Tianyi Lin, Jiawei Zhang, Zhengyuan Zhou

✅

Abstract

We consider online no-regret learning in unknown games with bandit feedback, where each player can only observe its reward at each time -- determined by all players' current joint action -- rather than its gradient. We focus on the class of textit{smooth and strongly monotone} games and study optimal no-regret learning therein. Leveraging self-concordant barrier functions, we first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $tilde{Theta}(nsqrt{T})$ under smooth and strongly concave reward functions ($n geq 1$ is the problem dimension). We then show that if each player applies this no-regret learning algorithm in strongly monotone games, the joint action converges in the textit{last iterate} to the unique Nash equilibrium at a rate of $tilde{Theta}(nT^{-1/2})$. Prior to our work, the best-known convergence rate in the same class of games is $tilde{O}(n^{2/3}T^{-1/3})$ (achieved by a different algorithm), thus leaving open the problem of optimal no-regret learning algorithms (since the known lower bound is $Omega(nT^{-1/2})$). Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning by identifying the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the single-agent learning and optimal last-iterate convergence rate in the multi-agent learning. We also present preliminary numerical results on several application problems to demonstrate the efficacy of our algorithm in terms of iteration count.

Create account to get full access

Overview

This paper explores no-regret learning in unknown games with limited feedback, where players can only observe their own rewards rather than full information about the game.
The researchers focus on a specific class of games called "smooth and strongly monotone" games, and develop a new algorithm that achieves optimal performance in these games.
The algorithm is shown to have two key properties: it achieves the best-known single-agent regret bound, and it leads to convergence of the players' joint actions to the unique Nash equilibrium at an optimal rate.

Plain English Explanation

Imagine you're playing a game with other people, but you can only see your own score after each round - you don't know how the other players are doing. This type of limited information is called "bandit feedback." The researchers in this paper looked at this situation and tried to find the best way for each player to learn and improve their strategy over time, even without full information about the game.

They focused on a special type of game where the payoffs to each player have two key properties: they change smoothly as the players' actions change, and there is a unique best outcome (called the Nash equilibrium) that all the players are trying to reach. In these types of games, the researchers developed a new learning algorithm that each player can use.

This algorithm has two big advantages. First, it allows each individual player to minimize their own "regret" - the difference between their actual score and the best score they could have gotten if they knew the game perfectly. Second, when all the players use this algorithm, the overall joint actions of the players converge to the Nash equilibrium at the fastest possible rate.

Prior to this work, no algorithm was known that achieved both of these optimal properties at the same time. So this new algorithm represents a significant advance in our understanding of how to learn effectively in complex, uncertain environments where players have limited information.

Technical Explanation

The researchers consider a multi-player game setting where each player can only observe their own reward (their "payoff") after taking an action, rather than having full information about the game. They focus on the class of "smooth and strongly monotone" games, which have nice mathematical properties that enable stronger learning and convergence guarantees.

Building on the technique of self-concordant barrier functions, the researchers first construct a new bandit learning algorithm for the single-agent setting. They show that this algorithm achieves the optimal regret bound of Θ̃(n√T) for smooth and strongly concave reward functions, where n is the problem dimension and T is the number of rounds.

They then analyze what happens when each player uses this no-regret learning algorithm in a strongly monotone multi-player game. They prove that the joint actions of the players will converge to the unique Nash equilibrium at a rate of Θ̃(nT^(-1/2)). This improves upon the previously best-known convergence rate of Õ(n^(2/3)T^(-1/3)) for this class of games.

By establishing these two optimal properties - optimal single-agent regret and optimal multi-agent convergence - the researchers solve an open problem and advance the state-of-the-art in bandit game-theoretical learning.

Critical Analysis

The paper makes several important contributions, but also has some limitations that could be explored further. A key strength is the identification of the first "doubly optimal" bandit learning algorithm, which is a significant theoretical advance.

However, the analysis is restricted to the specific class of "smooth and strongly monotone" games. It would be interesting to see how the algorithm performs in more general game settings. Additionally, the paper does not provide a detailed discussion of the practical considerations and potential challenges in implementing this algorithm in real-world applications.

Another area for further research could be the robustness of the algorithm to things like model misspecification, communication delays, or other real-world complications that may arise in multi-agent systems. Exploring these aspects could help bridge the gap between the theoretical guarantees and practical deployments.

Overall, this is a technically impressive piece of work that makes valuable contributions to the field of bandit game-theoretical learning. The new algorithm represents an important step forward, but there remain opportunities to expand the scope and real-world applicability of this line of research.

Conclusion

This paper presents a novel bandit learning algorithm that achieves optimal performance in a specific class of multi-player games with limited feedback. By proving bounds on both individual regret and the convergence rate of the joint actions to the Nash equilibrium, the researchers solve an open problem and advance the state-of-the-art in this area.

The algorithm's ability to deliver these strong theoretical guarantees is a significant accomplishment. While there are still some limitations to address, this work lays the groundwork for further developments in bandit game-theoretical learning, with potential applications in fields like network optimization, resource allocation, and multi-agent decision-making. Overall, this research represents an important contribution that pushes the boundaries of what is possible when players have access to only partial information about the game they are playing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

✨

Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback

Michael I. Jordan, Tianyi Lin, Zhengyuan Zhou

Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions: (1) in the single-agent setting, it achieves an optimal regret of $Theta(log T)$ for strongly convex cost functions; and (2) in the multi-agent setting of strongly monotone games, with each agent employing OGD, we obtain last-iterate convergence of the joint action to a unique Nash equilibrium at an optimal rate of $Theta(frac{1}{T})$. While these finite-time guarantees highlight its merits, OGD has the drawback that it requires knowing the strong convexity/monotonicity parameters. In this paper, we design a fully adaptive OGD algorithm, textsf{AdaOGD}, that does not require a priori knowledge of these parameters. In the single-agent setting, our algorithm achieves $O(log^2(T))$ regret under strong convexity, which is optimal up to a log factor. Further, if each agent employs textsf{AdaOGD} in strongly monotone games, the joint action converges in a last-iterate sense to a unique Nash equilibrium at a rate of $O(frac{log^3 T}{T})$, again optimal up to log factors. We illustrate our algorithms in a learning version of the classical newsvendor problem, where due to lost sales, only (noisy) gradient feedback can be observed. Our results immediately yield the first feasible and near-optimal algorithm for both the single-retailer and multi-retailer settings. We also extend our results to the more general setting of exp-concave cost functions and games, using the online Newton step (ONS) algorithm.

4/1/2024

cs.GT cs.LG

🛠️

Incentive-compatible Bandits: Importance Weighting No More

Julian Zimmert, Teodor V. Marinov

We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no regret with respect to the preferences of the best fixed expert in hindsight. citet{freeman2020no} propose an algorithm in the full information setting with optimal $O(sqrt{T log(K)})$ regret and $O(T^{2/3}(Klog(K))^{1/3})$ regret in the bandit setting. In this work we propose the first incentive-compatible algorithms that enjoy $O(sqrt{KT})$ regret bounds. We further demonstrate how simple loss-biasing allows the algorithm proposed in Freeman et al. 2020 to enjoy $tilde O(sqrt{KT})$ regret. As a byproduct of our approach we obtain the first bandit algorithm with nearly optimal regret bounds in the adversarial setting which works entirely on the observed loss sequence without the need for importance-weighted estimators. Finally, we provide an incentive-compatible algorithm that enjoys asymptotically optimal best-of-both-worlds regret guarantees, i.e., logarithmic regret in the stochastic regime as well as worst-case $O(sqrt{KT})$ regret.

5/13/2024

cs.LG cs.GT

🤷

Convergence to Nash Equilibrium and No-regret Guarantee in (Markov) Potential Games

Jing Dong, Baoxiang Wang, Yaoliang Yu

In this work, we study potential games and Markov potential games under stochastic cost and bandit feedback. We propose a variant of the Frank-Wolfe algorithm with sufficient exploration and recursive gradient estimation, which provably converges to the Nash equilibrium while attaining sublinear regret for each individual player. Our algorithm simultaneously achieves a Nash regret and a regret bound of $O(T^{4/5})$ for potential games, which matches the best available result, without using additional projection steps. Through carefully balancing the reuse of past samples and exploration of new samples, we then extend the results to Markov potential games and improve the best available Nash regret from $O(T^{5/6})$ to $O(T^{4/5})$. Moreover, our algorithm requires no knowledge of the game, such as the distribution mismatch coefficient, which provides more flexibility in its practical implementation. Experimental results corroborate our theoretical findings and underscore the practical effectiveness of our method.

4/11/2024

cs.GT cs.LG

Distributed No-Regret Learning for Multi-Stage Systems with End-to-End Bandit Feedback

I-Hong Hou

This paper studies multi-stage systems with end-to-end bandit feedback. In such systems, each job needs to go through multiple stages, each managed by a different agent, before generating an outcome. Each agent can only control its own action and learn the final outcome of the job. It has neither knowledge nor control on actions taken by agents in the next stage. The goal of this paper is to develop distributed online learning algorithms that achieve sublinear regret in adversarial environments. The setting of this paper significantly expands the traditional multi-armed bandit problem, which considers only one agent and one stage. In addition to the exploration-exploitation dilemma in the traditional multi-armed bandit problem, we show that the consideration of multiple stages introduces a third component, education, where an agent needs to choose its actions to facilitate the learning of agents in the next stage. To solve this newly introduced exploration-exploitation-education trilemma, we propose a simple distributed online learning algorithm, $epsilon-$EXP3. We theoretically prove that the $epsilon-$EXP3 algorithm is a no-regret policy that achieves sublinear regret. Simulation results show that the $epsilon-$EXP3 algorithm significantly outperforms existing no-regret online learning algorithms for the traditional multi-armed bandit problem.

4/9/2024

cs.LG cs.NI