Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It

Read original: arXiv:2404.15084 - Published 4/24/2024 by Yuta Saito, Masahiro Nomura

🛠️

Overview

The paper explores the challenge of optimizing hyperparameters for off-policy learning, where the goal is to find the best decision-making policy based on biased, historical data.
The authors show that naively applying unbiased estimators of generalization performance as a surrogate objective can lead to unexpected failures, where the optimization process pursues hyperparameters with greatly overestimated performance.
The paper proposes simple and computationally efficient corrections to the typical hyperparameter optimization (HPO) procedure to address these issues.

Plain English Explanation

In many real-world applications, such as recommender systems and personalized medicine, decision-making policies are learned from historical data that may be biased or incomplete. Researchers have made progress in developing estimators that can accurately evaluate the effectiveness of these "counterfactual" policies based on the biased data.

However, these estimators are often used not just for evaluation, but also to optimize the hyperparameters of the decision-making models. Hyperparameters are settings that are not learned from data, but instead chosen by the researcher to control the model's behavior. The process of finding the best hyperparameters is called hyperparameter optimization (HPO).

The authors of this paper show that naively using an unbiased estimator of generalization performance as the objective for HPO can lead to unexpected failures. The optimization process may converge on hyperparameters whose true performance is greatly overestimated by the biased data.

To address this issue, the paper proposes simple and efficient corrections to the typical HPO procedure. These corrections help the optimization process avoid being misled by the biased data and instead find hyperparameters that truly perform well.

Technical Explanation

The paper investigates the task of hyperparameter optimization (HPO) for off-policy learning, where the goal is to find the optimal hyperparameters for a decision-making policy based on biased, historical data.

The authors first show that naively applying an unbiased estimator of generalization performance as the surrogate objective in HPO can cause unexpected failures. The optimization process may converge on hyperparameters whose true performance is greatly overestimated by the biased data, a phenomenon the authors call "overestimation bias."

To address this issue, the paper proposes two simple and computationally efficient corrections to the typical HPO procedure:

Calibrated HPO: The authors suggest calibrating the unbiased estimator using a small amount of additional data to correct for the overestimation bias.
Uncertainty-Aware HPO: The authors propose incorporating uncertainty estimates of the unbiased estimator into the optimization process, allowing the HPO to avoid hyperparameters with highly uncertain performance estimates.

The effectiveness of these proposed corrections is demonstrated through extensive empirical investigations, where the authors show that the typical HPO procedure can fail severely in situations where the proposed methods succeed.

Critical Analysis

The paper addresses an important and practical challenge in the field of off-policy learning, where researchers often need to optimize hyperparameters based on biased, historical data. The authors' insights into the "overestimation bias" problem and their proposed solutions are valuable contributions to the literature.

However, the paper does not explore the limitations of the proposed methods, such as the sensitivity to the quality and quantity of the additional data required for calibration, or the impact of the uncertainty estimates on the overall optimization efficiency.

Additionally, the paper does not discuss potential extensions or future research directions, such as exploring the interplay between the proposed HPO corrections and other techniques like continual learning or early discarding.

Readers may also be interested in related work on asynchronous multi-fidelity optimization or population-based training, which could provide additional insights into the challenges of hyperparameter optimization.

Conclusion

This paper presents an important contribution to the field of off-policy learning by highlighting the challenges of using unbiased estimators for hyperparameter optimization and proposing effective solutions to address these challenges.

The authors' insights into the "overestimation bias" problem and their proposed corrections to the typical HPO procedure have the potential to improve the reliability and efficiency of decision-making models in a wide range of applications, from recommender systems to personalized medicine.

While the paper does not explore all the limitations and future research directions, it lays the groundwork for further advancements in this area and encourages readers to think critically about the complexities involved in optimizing hyperparameters for off-policy learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛠️

Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It

Yuta Saito, Masahiro Nomura

There has been a growing interest in off-policy evaluation in the literature such as recommender systems and personalized medicine. We have so far seen significant progress in developing estimators aimed at accurately estimating the effectiveness of counterfactual policies based on biased logged data. However, there are many cases where those estimators are used not only to evaluate the value of decision making policies but also to search for the best hyperparameters from a large candidate space. This work explores the latter hyperparameter optimization (HPO) task for off-policy learning. We empirically show that naively applying an unbiased estimator of the generalization performance as a surrogate objective in HPO can cause an unexpected failure, merely pursuing hyperparameters whose generalization performance is greatly overestimated. We then propose simple and computationally efficient corrections to the typical HPO procedure to deal with the aforementioned issues simultaneously. Empirical investigations demonstrate the effectiveness of our proposed HPO algorithm in situations where the typical procedure fails severely.

4/24/2024

🛠️

Pessimistic Off-Policy Optimization for Learning to Rank

Matej Cief, Branislav Kveton, Michal Kompan

Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient, and we analyze it. We study its Bayesian and frequentist variants and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines and is both robust and general.

8/26/2024

Hyperparameter Selection in Continual Learning

Thomas L. Lee, Sigrid Passano Hellan, Linus Ericsson, Elliot J. Crowley, Amos Storkey

In continual learning (CL) -- where a learner trains on a stream of data -- standard hyperparameter optimisation (HPO) cannot be applied, as a learner does not have access to all of the data at the same time. This has prompted the development of CL-specific HPO frameworks. The most popular way to tune hyperparameters in CL is to repeatedly train over the whole data stream with different hyperparameter settings. However, this end-of-training HPO is unrealistic as in practice a learner can only see the stream once. Hence, there is an open question: what HPO framework should a practitioner use for a CL problem in reality? This paper answers this question by evaluating several realistic HPO frameworks. We find that all the HPO frameworks considered, including end-of-training HPO, perform similarly. We therefore advocate using the realistic and most computationally efficient method: fitting the hyperparameters on the first task and then fixing them throughout training.

4/10/2024

🛠️

A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization

Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, Prateek Mittal

An open problem in differentially private deep learning is hyperparameter optimization (HPO). DP-SGD introduces new hyperparameters and complicates existing ones, forcing researchers to painstakingly tune hyperparameters with hundreds of trials, which in turn makes it impossible to account for the privacy cost of HPO without destroying the utility. We propose an adaptive HPO method that uses cheap trials (in terms of privacy cost and runtime) to estimate optimal hyperparameters and scales them up. We obtain state-of-the-art performance on 22 benchmark tasks, across computer vision and natural language processing, across pretraining and finetuning, across architectures and a wide range of $varepsilon in [0.01,8.0]$, all while accounting for the privacy cost of HPO.

5/7/2024