Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

Read original: arXiv:2312.12871 - Published 4/19/2024 by Yu Liu, Runzhe Wan, James McQueen, Doug Hains, Jinxiang Gu, Rui Song

Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

Overview

This paper focuses on estimating the effect size for duration recommendation in online experiments.
It leverages hierarchical models and objective utility approaches to improve the accuracy of effect size estimation.
The goal is to provide more reliable recommendations for the optimal duration of online experiments.

Plain English Explanation

Online experiments, such as A/B tests, are commonly used to evaluate the effectiveness of changes to websites or apps. One important metric in these experiments is the

duration

, or the length of time the experiment runs. Choosing the right duration is crucial, as it can impact the reliability and statistical power of the results.

The authors of this paper propose a new approach for estimating the effect size, which is a measure of the magnitude of the difference between the experimental conditions. By using

hierarchical models

and

objective utility approaches

, they aim to provide more accurate recommendations for the optimal duration of online experiments.

Hierarchical models take into account the nested structure of the data, where individual users are nested within different experimental conditions. This allows the model to better capture the variability within and between these conditions. Objective utility approaches, on the other hand, focus on maximizing the

utility

of the experiment, which could be measured in terms of statistical power, revenue, or other relevant metrics.

By combining these two techniques, the authors hope to give researchers and practitioners more reliable guidance on how long to run their online experiments, ultimately leading to better-informed decisions and more robust findings.

Technical Explanation

The paper presents a novel approach for estimating the effect size in online experiments, which is a crucial step in determining the optimal duration of the experiment. The authors leverage

hierarchical models

and

objective utility approaches

to improve the accuracy of the effect size estimation.

Hierarchical models are used to account for the nested structure of the data, where individual users are nested within different experimental conditions. This allows the model to better capture the variability within and between these conditions, leading to more precise estimates of the effect size.

The objective utility approach focuses on maximizing the

utility

of the experiment, which could be measured in terms of statistical power, revenue, or other relevant metrics. By incorporating this objective function into the effect size estimation, the authors aim to provide recommendations that are aligned with the overall goals of the experiment.

The authors demonstrate the effectiveness of their approach through a series of simulations and real-world case studies. They show that their method outperforms traditional approaches, such as pooled t-tests or ANOVA, in terms of accurate effect size estimation and optimal duration recommendation.

Critical Analysis

The authors acknowledge several limitations of their work, including the assumptions made about the underlying data-generating process and the reliance on the accuracy of the objective utility function. Additionally, the paper does not address the potential impact of

interference

network effects

on the experimental results, which could be an important consideration in some online settings.

Further research could explore the robustness of the proposed approach to violations of these assumptions, as well as investigate the performance of the method under different types of experimental designs and objectives. Additionally, incorporating missing pieces how framing uncertainty impacts longitudinal or doubly robust inference causal latent factor models could potentially enhance the method's ability to handle more complex scenarios.

Conclusion

This paper presents a novel approach for estimating the effect size in online experiments, leveraging hierarchical models and objective utility approaches. The authors demonstrate the effectiveness of their method in improving the accuracy of duration recommendations, which can have significant implications for the design and analysis of online experiments.

By providing more reliable guidance on experiment duration, the proposed technique can help researchers and practitioners make better-informed decisions, ultimately leading to more robust findings and more effective interventions. The authors' work contributes to the ongoing efforts to toward inference optimal mixture expert large language and ab testing under interference partial network information, further advancing the field of online experimentation and decision-making.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

Yu Liu, Runzhe Wan, James McQueen, Doug Hains, Jinxiang Gu, Rui Song

The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency. Traditionally, experimenters determine AES based on domain knowledge. However, this method becomes impractical for online experimentation services managing numerous experiments, and a more automated approach is hence of great demand. We initiate the study of data-driven AES selection in for online experimentation services by introducing two solutions. The first employs a three-layer Gaussian Mixture Model considering the heteroskedasticity across experiments, and it seeks to estimate the true expected effect size among positive experiments. The second method, grounded in utility theory, aims to determine the optimal effect size by striking a balance between the experiment's cost and the precision of decision-making. Through comparisons with baseline methods using both simulated and real data, we showcase the superior performance of the proposed approaches.

4/19/2024

Setting the duration of online A/B experiments

Harrison H. Li, Chaoyu Yu

In designing an online A/B experiment, it is crucial to select a sample size and duration that ensure the resulting confidence interval (CI) for the treatment effect is the right width to detect an effect of meaningful magnitude with sufficient statistical power without wasting resources. While the relationship between sample size and CI width is well understood, the effect of experiment duration on CI width remains less clear. This paper provides an analytical formula for the width of a CI based on a ratio treatment effect estimator as a function of both sample size (N) and duration (T). The formula is derived from a mixed effects model with two variance components. One component, referred to as the temporal variance, persists over time for experiments where the same users are kept in the same experiment arm across different days. The remaining error variance component, by contrast, decays to zero as T gets large. The formula we derive introduces a key parameter that we call the user-specific temporal correlation (UTC), which quantifies the relative sizes of the two variance components and can be estimated from historical experiments. Higher UTC indicates a slower decay in CI width over time. On the other hand, when the UTC is 0 -- as for experiments where users shuffle in and out of the experiment across days -- the CI width decays at the standard parametric 1/T rate. We also study how access to pre-period data for the users in the experiment affects the CI width decay. We show our formula closely explains CI widths on real A/B experiments at YouTube.

8/7/2024

Uplift Modeling Under Limited Supervision

George Panagopoulos, Daniele Malitesta, Fragkiskos D. Malliaros, Jun Pang

Estimating causal effects in e-commerce tends to involve costly treatment assignments which can be impractical in large-scale settings. Leveraging machine learning to predict such treatment effects without actual intervention is a standard practice to diminish the risk. However, existing methods for treatment effect prediction tend to rely on training sets of substantial size, which are built from real experiments and are thus inherently risky to create. In this work we propose a graph neural network to diminish the required training set size, relying on graphs that are common in e-commerce data. Specifically, we view the problem as node regression with a restricted number of labeled instances, develop a two-model neural architecture akin to previous causal effect estimators, and test varying message-passing layers for encoding. Furthermore, as an extra step, we combine the model with an acquisition function to guide the creation of the training set in settings with extremely low experimental budget. The framework is flexible since each step can be used separately with other models or treatment policies. The experiments on real large-scale networks indicate a clear advantage of our methodology over the state of the art, which in many cases performs close to random, underlining the need for models that can generalize with limited supervision to reduce experimental risks.

9/4/2024

📈

Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation

Divyat Mahajan, Ioannis Mitliagkas, Brady Neal, Vasilis Syrgkanis

We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling.

4/30/2024