UCB-driven Utility Function Search for Multi-objective Reinforcement Learning

2405.00410

Published 5/17/2024 by Yucheng Shi, Alexandros Agapitos, David Lynch, Giorgio Cruciata, Cengis Hasan, Hao Wang, Yayu Yao, Aleksandar Milenovic

cs.LG

UCB-driven Utility Function Search for Multi-objective Reinforcement Learning

Abstract

In Multi-objective Reinforcement Learning (MORL) agents are tasked with optimising decision-making behaviours that trade-off between multiple, possibly conflicting, objectives. MORL based on decomposition is a family of solution methods that employ a number of utility functions to decompose the multi-objective problem into individual single-objective problems solved simultaneously in order to approximate a Pareto front of policies. We focus on the case of linear utility functions parameterised by weight vectors w. We introduce a method based on Upper Confidence Bound to efficiently search for the most promising weight vectors during different stages of the learning process, with the aim of maximising the hypervolume of the resulting Pareto front. The proposed method is shown to outperform various MORL baselines on Mujoco benchmark problems across different random seeds. The code is online at: https://github.com/SYCAMORE-1/ucb-MOPPO.

Create account to get full access

Overview

This paper proposes a new approach for multi-objective reinforcement learning called UCB-driven Utility Function Search (UCBUS).
The key idea is to use the Upper Confidence Bound (UCB) algorithm to efficiently search for a utility function that balances the different objectives in a multi-objective problem.
The authors demonstrate UCBUS on several continuous control tasks and show that it can outperform other multi-objective RL methods.

Plain English Explanation

In many real-world problems, there are multiple goals or objectives that need to be balanced. For example, when designing a robot, you might want it to be strong, fast, and energy-efficient. Reinforcement learning is a powerful technique for training agents to optimize for multiple objectives, but it can be challenging to find the right balance.

The UCBUS method proposed in this paper tries to address this challenge. The key idea is to use the UCB algorithm, which is commonly used in multi-armed bandit problems, to efficiently search for the best utility function. This utility function determines how the agent should balance the different objectives.

The authors test UCBUS on several continuous control tasks, such as controlling a UAV or optimizing a hierarchical output system. They show that UCBUS can outperform other multi-objective RL methods, suggesting that it is a promising approach for tackling complex real-world problems with multiple, competing objectives.

Technical Explanation

The core idea behind UCBUS is to use the UCB algorithm to efficiently search for a utility function that balances the different objectives in a multi-objective reinforcement learning problem. Specifically, the agent maintains a set of candidate utility functions, and at each step, it selects the one with the highest UCB value to evaluate.

The UCB value is calculated based on the historical performance of each utility function, as well as an exploration term that encourages the agent to try out promising but underexplored utility functions. Over time, the agent converges to the utility function that provides the best balance of the objectives.

The authors evaluate UCBUS on several continuous control tasks, including a multi-objective optimization problem with local optima. They compare its performance to other multi-objective RL methods, such as those based on scalarization or multi-agent approaches. The results show that UCBUS can achieve better final performance and sample efficiency, suggesting that the UCB-driven search is an effective way to navigate the complex trade-offs in multi-objective problems.

Critical Analysis

The UCBUS approach presented in this paper is a novel and promising contribution to the field of multi-objective reinforcement learning. The authors have demonstrated its effectiveness on several challenging benchmark tasks, which is a significant step forward.

However, the paper does not address some potential limitations of the method. For example, the performance of UCBUS may depend heavily on the choice of the candidate utility functions, and it is not clear how to ensure that the search space contains a good approximation of the true Pareto front. Additionally, the computational cost of maintaining and evaluating the UCB values for a large set of utility functions could be prohibitive in some real-world applications.

Further research is needed to address these limitations and explore the broader applicability of UCBUS. Potential areas for improvement include investigating more efficient ways to represent and search the utility function space, as well as developing methods to automatically generate candidate utility functions based on the problem structure.

Conclusion

The UCBUS approach proposed in this paper represents a significant advancement in the field of multi-objective reinforcement learning. By leveraging the UCB algorithm to efficiently search for the right utility function, the method can effectively balance competing objectives and outperform other state-of-the-art techniques.

While the paper does not address all the potential limitations of the approach, the results demonstrate its promising potential for tackling complex real-world problems with multiple, conflicting goals. As the field of reinforcement learning continues to evolve, methods like UCBUS will play an increasingly important role in enabling agents to make intelligent decisions in the face of multifaceted, real-world challenges.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Demonstration Guided Multi-Objective Reinforcement Learning

Junlin Lu, Patrick Mannion, Karl Mason

Multi-objective reinforcement learning (MORL) is increasingly relevant due to its resemblance to real-world scenarios requiring trade-offs between multiple objectives. Catering to diverse user preferences, traditional reinforcement learning faces amplified challenges in MORL. To address the difficulty of training policies from scratch in MORL, we introduce demonstration-guided multi-objective reinforcement learning (DG-MORL). This novel approach utilizes prior demonstrations, aligns them with user preferences via corner weight support, and incorporates a self-evolving mechanism to refine suboptimal demonstrations. Our empirical studies demonstrate DG-MORL's superiority over existing MORL algorithms, establishing its robustness and efficacy, particularly under challenging conditions. We also provide an upper bound of the algorithm's sample complexity.

4/8/2024

cs.LG cs.AI

Generalized Multi-Objective Reinforcement Learning with Envelope Updates in URLLC-enabled Vehicular Networks

Zijiang Yan, Hina Tabassum

We develop a novel multi-objective reinforcement learning (MORL) framework to jointly optimize wireless network selection and autonomous driving policies in a multi-band vehicular network operating on conventional sub-6GHz spectrum and Terahertz frequencies. The proposed framework is designed to 1. maximize the traffic flow and 2. minimize collisions by controlling the vehicle's motion dynamics (i.e., speed and acceleration), and enhance the ultra-reliable low-latency communication (URLLC) while minimizing handoffs (HOs). We cast this problem as a multi-objective Markov Decision Process (MOMDP) and develop solutions for both predefined and unknown preferences of the conflicting objectives. Specifically, deep-Q-network and double deep-Q-network-based solutions are developed first that consider scalarizing the transportation and telecommunication rewards using predefined preferences. We then develop a novel envelope MORL solution which develop policies that address multiple objectives with unknown preferences to the agent. While this approach reduces reliance on scalar rewards, policy effectiveness varying with different preferences is a challenge. To address this, we apply a generalized version of the Bellman equation and optimize the convex envelope of multi-objective Q values to learn a unified parametric representation capable of generating optimal policies across all possible preference configurations. Following an initial learning phase, our agent can execute optimal policies under any specified preference or infer preferences from minimal data samples.Numerical results validate the efficacy of the envelope-based MORL solution and demonstrate interesting insights related to the inter-dependency of vehicle motion dynamics, HOs, and the communication data rate. The proposed policies enable autonomous vehicles to adopt safe driving behaviors with improved connectivity.

5/21/2024

cs.LG cs.AI cs.NI

Deep Multi-Objective Reinforcement Learning for Utility-Based Infrastructural Maintenance Optimization

Jesse van Remmerden, Maurice Kenter, Diederik M. Roijers, Charalampos Andriotis, Yingqian Zhang, Zaharah Bukhsh

In this paper, we introduce Multi-Objective Deep Centralized Multi-Agent Actor-Critic (MO- DCMAC), a multi-objective reinforcement learning (MORL) method for infrastructural maintenance optimization, an area traditionally dominated by single-objective reinforcement learning (RL) approaches. Previous single-objective RL methods combine multiple objectives, such as probability of collapse and cost, into a singular reward signal through reward-shaping. In contrast, MO-DCMAC can optimize a policy for multiple objectives directly, even when the utility function is non-linear. We evaluated MO-DCMAC using two utility functions, which use probability of collapse and cost as input. The first utility function is the Threshold utility, in which MO-DCMAC should minimize cost so that the probability of collapse is never above the threshold. The second is based on the Failure Mode, Effects, and Criticality Analysis (FMECA) methodology used by asset managers to asses maintenance plans. We evaluated MO-DCMAC, with both utility functions, in multiple maintenance environments, including ones based on a case study of the historical quay walls of Amsterdam. The performance of MO-DCMAC was compared against multiple rule-based policies based on heuristics currently used for constructing maintenance plans. Our results demonstrate that MO-DCMAC outperforms traditional rule-based policies across various environments and utility functions.

6/11/2024

cs.AI cs.LG

🗣️

Multi-objective optimisation via the R2 utilities

Ben Tu, Nikolas Kantas, Robert M. Lee, Behrang Shafei

The goal of multi-objective optimisation is to identify a collection of points which describe the best possible trade-offs between the multiple objectives. In order to solve this vector-valued optimisation problem, practitioners often appeal to the use of scalarisation functions in order to transform the multi-objective problem into a collection of single-objective problems. This set of scalarised problems can then be solved using traditional single-objective optimisation techniques. In this work, we formalise this convention into a general mathematical framework. We show how this strategy effectively recasts the original multi-objective optimisation problem into a single-objective optimisation problem defined over sets. An appropriate class of objective functions for this new problem are the R2 utilities, which are utility functions that are defined as a weighted integral over the scalarised optimisation problems. As part of our work, we show that these utilities are monotone and submodular set functions which can be optimised effectively using greedy optimisation algorithms. We then analyse the performance of these greedy algorithms both theoretically and empirically. Our analysis largely focusses on Bayesian optimisation, which is a popular probabilistic framework for black-box optimisation.

5/2/2024

cs.LG stat.ML