Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes

2311.11789

Published 5/1/2024 by Lakshmi Mandal, Chandrashekar Lakshminarayanan, Shalabh Bhatnagar

🗣️

Abstract

In this work, we consider a cooperative multi-agent Markov decision process (MDP) involving m agents. At each decision epoch, all the m agents independently select actions in order to maximize a common long-term objective. In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents, incurring huge computational costs. Thus, recent works consider decentralized policy improvement, where each agent improves its decisions unilaterally, assuming that the decisions of the other agents are fixed. However, exact value functions are considered in the literature, which is computationally expensive for a large number of agents with high dimensional state-action space. Thus, we propose approximate decentralized policy iteration algorithms, using approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. Further, we consider (both) cooperative multi-agent finite and infinite horizon discounted MDPs and propose suitable algorithms in each case. Moreover, we provide theoretical guarantees for our algorithms and also demonstrate their advantages over existing state-of-the-art algorithms in the literature.

Create account to get full access

Overview

This paper proposes approximate decentralized policy iteration algorithms for cooperative multi-agent Markov decision processes (MDPs).
The algorithms use approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement.
The paper considers both cooperative multi-agent finite and infinite horizon discounted MDPs and provides theoretical guarantees for the proposed algorithms.
The algorithms are demonstrated to have advantages over existing state-of-the-art algorithms in the literature.

Plain English Explanation

In a cooperative multi-agent scenario, multiple agents work together to achieve a common long-term objective. When using a policy iteration process to find the optimal actions for each agent, the number of possible actions grows exponentially as the number of agents increases, leading to high computational costs.

To address this issue, the paper proposes approximate decentralized policy iteration algorithms that allow each agent to improve its decisions independently, assuming the decisions of the other agents are fixed. This decentralized approach helps reduce the computational complexity compared to considering the exact value functions for all agents.

The key idea is to use approximate linear programming with function approximation to compute the approximate value function for each agent's policy improvement. This allows the algorithms to be applied to cooperative multi-agent problems with large state-action spaces, which is often the case in real-world applications.

The paper considers both finite and infinite horizon discounted cooperative multi-agent MDPs and provides suitable algorithms for each case. Additionally, the authors provide theoretical guarantees for the performance of their algorithms and demonstrate their advantages over existing state-of-the-art techniques.

Technical Explanation

The paper considers a cooperative multi-agent Markov decision process (MDP) involving

agents. At each decision epoch, all the agents independently select actions to maximize a common long-term objective. In the policy iteration process of a multi-agent setup, the number of actions grows exponentially with the number of agents, leading to high computational costs.

To address this issue, the authors propose approximate decentralized policy iteration algorithms, where each agent improves its decisions unilaterally, assuming the decisions of the other agents are fixed. This decentralized approach helps reduce the computational complexity compared to considering the exact value functions for all agents.

The key contribution of this work is the use of approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. This approach allows the algorithms to be applied to cooperative multi-agent problems with large state-action spaces, which is often the case in real-world applications.

The paper considers both cooperative multi-agent finite and infinite horizon discounted MDPs and proposes suitable algorithms for each case. The authors provide theoretical guarantees for the performance of their algorithms and demonstrate their advantages over existing state-of-the-art techniques, such as Nonstationary Reinforcement Learning with Linear Function Approximation, Sample-Efficient Learning for Infinite Horizon Average Reward, Distributed Multi-Agent Reinforcement Learning Based on Graph Neural Networks, Solving Long-Run Average Reward Robust MDPs, and Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning.

Critical Analysis

The paper presents a novel approach to tackling the computational challenges of cooperative multi-agent MDPs by using approximate decentralized policy iteration algorithms. The authors' use of approximate linear programming with function approximation is a promising strategy to handle large state-action spaces and reduce the exponential growth of the action space as the number of agents increases.

However, the paper does not discuss the potential limitations of the proposed algorithms, such as the accuracy of the function approximation or the impact of the assumption that the decisions of the other agents are fixed. Additionally, the authors do not explore the scalability of their approach to very large-scale multi-agent systems or the potential for further optimization and parallelization of the algorithms.

Furthermore, the paper does not provide a comprehensive comparison with other state-of-the-art approaches beyond the specific algorithms mentioned. Exploring the performance of the proposed algorithms in a wider range of cooperative multi-agent scenarios and against a broader set of existing techniques could help better evaluate their strengths and weaknesses.

Conclusion

This paper presents a novel approach to cooperative multi-agent MDPs by proposing approximate decentralized policy iteration algorithms that use approximate linear programming with function approximation. The algorithms can effectively handle large state-action spaces and reduce the computational complexity of the policy iteration process, making them suitable for real-world applications with many agents.

The theoretical guarantees and the demonstrated advantages over existing state-of-the-art algorithms suggest that the proposed approach is a promising direction for further research and development in the field of cooperative multi-agent reinforcement learning. Future work could explore the scalability of the algorithms, investigate the impact of relaxing the fixed-decision assumption, and conduct more comprehensive comparisons with a wider range of existing techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🎯

Refined Sample Complexity for Markov Games with Independent Linear Function Approximation

Yan Dai, Qiwen Cui, Simon S. Du

Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL). It was long believed that the curse of multi-agents (i.e., the algorithmic performance drops exponentially with the number of agents) is unavoidable until several recent works (Daskalakis et al., 2023; Cui et al., 2023; Wang et al., 2023). While these works resolved the curse of multi-agents, when the state spaces are prohibitively large and (linear) function approximations are deployed, they either had a slower convergence rate of $O(T^{-1/4})$ or brought a polynomial dependency on the number of actions $A_{max}$ -- which is avoidable in single-agent cases even when the loss functions can arbitrarily vary with time. This paper first refines the AVLPR framework by Wang et al. (2023), with an insight of designing *data-dependent* (i.e., stochastic) pessimistic estimation of the sub-optimality gap, allowing a broader choice of plug-in algorithms. When specialized to MGs with independent linear function approximations, we propose novel *action-dependent bonuses* to cover occasionally extreme estimation errors. With the help of state-of-the-art techniques from the single-agent RL literature, we give the first algorithm that tackles the curse of multi-agents, attains the optimal $O(T^{-1/2})$ convergence rate, and avoids $text{poly}(A_{max})$ dependency simultaneously.

6/12/2024

cs.LG cs.GT stat.ML

Locally Interdependent Multi-Agent MDP: Theoretical Framework for Decentralized Agents with Dynamic Dependencies

Alex DeWeese, Guannan Qu

Many multi-agent systems in practice are decentralized and have dynamically varying dependencies. There has been a lack of attempts in the literature to analyze these systems theoretically. In this paper, we propose and theoretically analyze a decentralized model with dynamically varying dependencies called the Locally Interdependent Multi-Agent MDP. This model can represent problems in many disparate domains such as cooperative navigation, obstacle avoidance, and formation control. Despite the intractability that general partially observable multi-agent systems suffer from, we propose three closed-form policies that are theoretically near-optimal in this setting and can be scalable to compute and store. Consequentially, we reveal a fundamental property of Locally Interdependent Multi-Agent MDP's that the partially observable decentralized solution is exponentially close to the fully observable solution with respect to the visibility radius. We then discuss extensions of our closed-form policies to further improve tractability. We conclude by providing simulations to investigate some long horizon behaviors of our closed-form policies.

6/12/2024

cs.LG cs.AI cs.MA

📶

High-probability sample complexities for policy evaluation with linear function approximation

Gen Li, Weichen Wu, Yuejie Chi, Cong Ma, Alessandro Rinaldo, Yuting Wei

This paper is concerned with the problem of policy evaluation with linear function approximation in discounted infinite horizon Markov decision processes. We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms: the temporal difference (TD) learning algorithm and the two-timescale linear TD with gradient correction (TDC) algorithm. In both the on-policy setting, where observations are generated from the target policy, and the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, we establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level. We also exhihit an explicit dependence on problem-related quantities, and show in the on-policy setting that our upper bound matches the minimax lower bound on crucial problem parameters, including the choice of the feature maps and the problem dimension.

5/3/2024

stat.ML cs.IT cs.LG

Approximate Global Convergence of Independent Learning in Multi-Agent Systems

Ruiyang Jin, Zaiwei Chen, Yiheng Lin, Jie Song, Adam Wierman

Independent learning (IL), despite being a popular approach in practice to achieve scalability in large-scale multi-agent systems, usually lacks global convergence guarantees. In this paper, we study two representative algorithms, independent $Q$-learning and independent natural actor-critic, within value-based and policy-based frameworks, and provide the first finite-sample analysis for approximate global convergence. The results imply a sample complexity of $tilde{mathcal{O}}(epsilon^{-2})$ up to an error term that captures the dependence among agents and characterizes the fundamental limit of IL in achieving global convergence. To establish the result, we develop a novel approach for analyzing IL by constructing a separable Markov decision process (MDP) for convergence analysis and then bounding the gap due to model difference between the separable MDP and the original one. Moreover, we conduct numerical experiments using a synthetic MDP and an electric vehicle charging example to verify our theoretical findings and to demonstrate the practical applicability of IL.

5/31/2024

cs.LG