Optimistic Q-learning for average reward and episodic reinforcement learning

Priyank Agrawal, Shipra Agrawal

We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the expected time to visit some frequent state $s_0$ is finite and upper bounded by $H$. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time {it for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of $tilde{O}(H^5 Ssqrt{AT})$, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A key technical novelty of our work is to introduce an $overline{L}$ operator defined as $overline{L} v = frac{1}{H} sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. We show that under the given assumption, the $overline{L}$ operator has a strict contraction (in span) even in the average reward setting. Our algorithm design then uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Therefore, we provide a unified view of regret minimization in episodic and non-episodic settings that may be of independent interest.

7/19/2024

🏅

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Kihyuk Hong, Yufan Zhang, Ambuj Tewari

We resolve the open problem of designing a computationally efficient algorithm for infinite-horizon average-reward linear Markov Decision Processes (MDPs) with $widetilde{O}(sqrt{T})$ regret. Previous approaches with $widetilde{O}(sqrt{T})$ regret either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity. In this paper, we approximate the average-reward setting by the discounted setting and show that running an optimistic value iteration-based algorithm for learning the discounted setting achieves $widetilde{O}(sqrt{T})$ regret when the discounting factor $gamma$ is tuned appropriately. The challenge in the approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - gamma)$. We use a computationally efficient clipping operator that constrains the span of the optimistic state value function estimate to achieve a sharp regret bound in terms of the effective horizon, which leads to $widetilde{O}(sqrt{T})$ regret.

5/27/2024

👨‍🏫

Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes

Bhargav Ganguly, Yang Xu, Vaneet Aggarwal

This paper investigates the potential of quantum acceleration in addressing infinite horizon Markov Decision Processes (MDPs) to enhance average reward outcomes. We introduce an innovative quantum framework for the agent's engagement with an unknown MDP, extending the conventional interaction paradigm. Our approach involves the design of an optimism-driven tabular Reinforcement Learning algorithm that harnesses quantum signals acquired by the agent through efficient quantum mean estimation techniques. Through thorough theoretical analysis, we demonstrate that the quantum advantage in mean estimation leads to exponential advancements in regret guarantees for infinite horizon Reinforcement Learning. Specifically, the proposed Quantum algorithm achieves a regret bound of $tilde{mathcal{O}}(1)$, a significant improvement over the $tilde{mathcal{O}}(sqrt{T})$ bound exhibited by classical counterparts.

4/30/2024

🏅

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Woojin Chae, Dabeen Lee

This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of $widetilde{mathcal{O}}(d^{3/2}mathrm{sp}(v^*)sqrt{T})$ over $T$ time steps where $mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$ is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of $widetilde{mathcal{O}}(dcdotmathrm{sp}(v^*)sqrt{T})$. The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.

9/25/2024

Optimistic Q-learning for average reward and episodic reinforcement learning

Overview

Plain English Explanation

Technical Explanation

Critical Analysis

Conclusion

Related Papers