Multi-Objective Recommendation via Multivariate Policy Learning

2405.02141

Published 5/6/2024 by Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wenzhe Shi, Aleksei Ustimenko

cs.IR cs.LG

🏅

Abstract

Real-world recommender systems often need to balance multiple objectives when deciding which recommendations to present to users. These include behavioural signals (e.g. clicks, shares, dwell time), as well as broader objectives (e.g. diversity, fairness). Scalarisation methods are commonly used to handle this balancing task, where a weighted average of per-objective reward signals determines the final score used for ranking. Naturally, how these weights are computed exactly, is key to success for any online platform. We frame this as a decision-making task, where the scalarisation weights are actions taken to maximise an overall North Star reward (e.g. long-term user retention or growth). We extend existing policy learning methods to the continuous multivariate action domain, proposing to maximise a pessimistic lower bound on the North Star reward that the learnt policy will yield. Typical lower bounds based on normal approximations suffer from insufficient coverage, and we propose an efficient and effective policy-dependent correction for this. We provide guidance to design stochastic data collection policies, as well as highly sensitive reward signals. Empirical observations from simulations, offline and online experiments highlight the efficacy of our deployed approach.

Create account to get full access

Overview

Real-world recommender systems must balance multiple objectives, including user behavior signals and broader goals like diversity and fairness.
Scalarization methods, which use a weighted average of per-objective reward signals to determine the final ranking score, are commonly used to handle this balancing task.
This paper frames the computation of scalarization weights as a decision-making task, where the goal is to maximize an overall "North Star" reward (e.g., long-term user retention or growth).
The authors propose an approach that extends existing policy learning methods to the continuous multivariate action domain, aiming to maximize a pessimistic lower bound on the North Star reward.

Plain English Explanation

Recommender systems, the algorithms that suggest products or content to users, often need to balance multiple competing goals. For example, they may need to consider user engagement signals like clicks and time spent, as well as broader objectives like providing diverse recommendations or ensuring fairness.

Commonly, recommender systems use a scalarization method to combine these different objectives into a single score that determines the final ranking of recommended items. Scalarization involves calculating a weighted average of the various reward signals, where the weights determine the relative importance of each objective.

This paper views the process of setting these scalarization weights as a decision-making problem. The goal is to find the set of weights that will maximize an overarching "North Star" metric, such as long-term user retention or platform growth. The authors propose an approach that builds on existing policy learning methods to efficiently optimize these weights in a way that robustly maximizes the North Star reward.

The key innovation is the use of a pessimistic lower bound on the North Star reward, which helps the system make conservative decisions that are less vulnerable to uncertainty or unforeseen circumstances. This is an important consideration for real-world recommender systems, which often operate in complex, dynamic environments.

Technical Explanation

The paper presents a framework for multi-objective policy learning in the context of recommender systems. The authors formulate the problem as a decision-making task, where the goal is to learn a policy that determines the scalarization weights used to combine different reward signals (e.g., user engagement, diversity, fairness) into a single "North Star" metric.

To solve this problem, the authors extend existing policy learning methods to the continuous, multivariate action domain. They propose maximizing a pessimistic lower bound on the North Star reward, which helps the system make robust decisions that are less vulnerable to uncertainty.

Typical lower bound approaches based on normal approximations often suffer from insufficient coverage, so the authors introduce an efficient and effective policy-dependent correction to address this issue. They also provide guidance on designing stochastic data collection policies and highly sensitive reward signals to support the learning process.

The paper presents empirical observations from simulations, offline experiments, and online deployments, demonstrating the effectiveness of the proposed approach. The results highlight the importance of balancing multiple objectives in real-world recommender systems and the potential benefits of the authors' decision-making framework.

Critical Analysis

The paper presents a well-designed and compelling approach to the challenge of multi-objective optimization in recommender systems. The use of a pessimistic lower bound to guide the policy learning process is a sensible and robust strategy, addressing the limitations of more traditional methods.

One potential limitation of the approach is the reliance on accurate modeling of the North Star reward function and its relationship to the individual objective signals. In practice, these relationships may be complex and difficult to capture precisely, which could impact the effectiveness of the optimization process.

Additionally, the paper does not delve into the specific tradeoffs or potential unintended consequences of optimizing for the North Star reward. There may be instances where maximizing this high-level metric could lead to suboptimal outcomes for certain user segments or aspects of the recommender system's performance.

Further research could explore ways to incorporate more nuanced multi-objective optimization techniques, such as Pareto-efficient frontiers, to better capture the inherent tradeoffs and ensure a more balanced approach.

Conclusion

This paper presents a novel framework for multi-objective policy learning in the context of real-world recommender systems. By framing the computation of scalarization weights as a decision-making task and proposing an approach that maximizes a pessimistic lower bound on the North Star reward, the authors offer a robust and effective solution to a critical challenge facing many online platforms.

The technical insights and empirical observations provided in the paper highlight the importance of balancing multiple objectives, such as user engagement, diversity, and fairness, when designing recommender systems. The authors' work lays the groundwork for further advancements in this area, which could have significant implications for improving the user experience and driving long-term platform growth.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Reduced-Rank Multi-objective Policy Learning and Optimization

Ezinne Nwankwo, Michael I. Jordan, Angela Zhou

Evaluating the causal impacts of possible interventions is crucial for informing decision-making, especially towards improving access to opportunity. However, if causal effects are heterogeneous and predictable from covariates, personalized treatment decisions can improve individual outcomes and contribute to both efficiency and equity. In practice, however, causal researchers do not have a single outcome in mind a priori and often collect multiple outcomes of interest that are noisy estimates of the true target of interest. For example, in government-assisted social benefit programs, policymakers collect many outcomes to understand the multidimensional nature of poverty. The ultimate goal is to learn an optimal treatment policy that in some sense maximizes multiple outcomes simultaneously. To address such issues, we present a data-driven dimensionality-reduction methodology for multiple outcomes in the context of optimal policy learning with multiple objectives. We learn a low-dimensional representation of the true outcome from the observed outcomes using reduced rank regression. We develop a suite of estimates that use the model to denoise observed outcomes, including commonly-used index weightings. These methods improve estimation error in policy evaluation and optimization, including on a case study of real-world cash transfer and social intervention data. Reducing the variance of noisy social outcomes can improve the performance of algorithmic allocations.

4/30/2024

cs.LG stat.ML

Scalarisation-based risk concepts for robust multi-objective optimisation

Ben Tu, Nikolas Kantas, Robert M. Lee, Behrang Shafei

Robust optimisation is a well-established framework for optimising functions in the presence of uncertainty. The inherent goal of this problem is to identify a collection of inputs whose outputs are both desirable for the decision maker, whilst also being robust to the underlying uncertainties in the problem. In this work, we study the multi-objective extension of this problem from a computational standpoint. We identify that the majority of all robust multi-objective algorithms rely on two key operations: robustification and scalarisation. Robustification refers to the strategy that is used to marginalise over the uncertainty in the problem. Whilst scalarisation refers to the procedure that is used to encode the relative importance of each objective. As these operations are not necessarily commutative, the order that they are performed in has an impact on the resulting solutions that are identified and the final decisions that are made. This work aims to give an exposition on the philosophical differences between these two operations and highlight when one should opt for one ordering over the other. As part of our analysis, we showcase how many existing risk concepts can be easily integrated into the specification and solution of a robust multi-objective optimisation problem. Besides this, we also demonstrate how one can principally define the notion of a robust Pareto front and a robust performance metric based on our robustify and scalarise methodology. To illustrate the efficacy of these new ideas, we present two insightful numerical case studies which are based on real-world data sets.

5/17/2024

cs.LG stat.ML

🗣️

Multi-objective optimisation via the R2 utilities

Ben Tu, Nikolas Kantas, Robert M. Lee, Behrang Shafei

The goal of multi-objective optimisation is to identify a collection of points which describe the best possible trade-offs between the multiple objectives. In order to solve this vector-valued optimisation problem, practitioners often appeal to the use of scalarisation functions in order to transform the multi-objective problem into a collection of single-objective problems. This set of scalarised problems can then be solved using traditional single-objective optimisation techniques. In this work, we formalise this convention into a general mathematical framework. We show how this strategy effectively recasts the original multi-objective optimisation problem into a single-objective optimisation problem defined over sets. An appropriate class of objective functions for this new problem are the R2 utilities, which are utility functions that are defined as a weighted integral over the scalarised optimisation problems. As part of our work, we show that these utilities are monotone and submodular set functions which can be optimised effectively using greedy optimisation algorithms. We then analyse the performance of these greedy algorithms both theoretically and empirically. Our analysis largely focusses on Bayesian optimisation, which is a popular probabilistic framework for black-box optimisation.

5/2/2024

cs.LG stat.ML

🏅

Robust Reinforcement Learning Objectives for Sequential Recommender Systems

Melissa Mozifian, Tristan Sylvain, Dave Evans, Lili Meng

Attention-based sequential recommendation methods have shown promise in accurately capturing users' evolving interests from their past interactions. Recent research has also explored the integration of reinforcement learning (RL) into these models, in addition to generating superior user representations. By framing sequential recommendation as an RL problem with reward signals, we can develop recommender systems that incorporate direct user feedback in the form of rewards, enhancing personalization for users. Nonetheless, employing RL algorithms presents challenges, including off-policy training, expansive combinatorial action spaces, and the scarcity of datasets with sufficient reward signals. Contemporary approaches have attempted to combine RL and sequential modeling, incorporating contrastive-based objectives and negative sampling strategies for training the RL component. In this work, we further emphasize the efficacy of contrastive-based objectives paired with augmentation to address datasets with extended horizons. Additionally, we recognize the potential instability issues that may arise during the application of negative sampling. These challenges primarily stem from the data imbalance prevalent in real-world datasets, which is a common issue in offline RL contexts. Furthermore, we introduce an enhanced methodology aimed at providing a more effective solution to these challenges. Experimental results across several real datasets show our method with increased robustness and state-of-the-art performance.

4/19/2024

cs.LG cs.AI cs.IR