Setting the duration of online A/B experiments

Read original: arXiv:2408.02830 - Published 8/7/2024 by Harrison H. Li, Chaoyu Yu

Setting the duration of online A/B experiments

Overview

The paper discusses how to determine the appropriate duration for online A/B experiments to ensure reliable results.
It analyzes the relationship between confidence interval (CI) width and experiment duration, providing a framework for setting experiment duration.
The authors propose a method to estimate the minimum experiment duration required to achieve a target CI width.

Plain English Explanation

When companies run online experiments (A/B tests) to compare the performance of different product features, it's important to determine how long the experiment should run. Run the experiment for too short a time, and the results may not be statistically reliable. Run it for too long, and you're wasting time and resources.

This paper offers a way to find the "sweet spot" - the minimum experiment duration needed to get accurate, trustworthy results. The key is understanding how the confidence interval (a statistical measure of result reliability) changes as the experiment goes on.

The authors show that the confidence interval gets narrower over time as more data is collected. They provide a framework to estimate how long you need to run the experiment to achieve a target confidence interval width. This helps ensure the results are precise enough to make informed decisions, without running the test longer than necessary.

By using this approach, companies can set the right experiment duration upfront, saving time and resources compared to running experiments for an arbitrary length of time.

Technical Explanation

The paper explores the relationship between the confidence interval (CI) width of an A/B experiment's results and the experiment's duration. The CI width is a measure of the precision or reliability of the experiment's findings.

The authors derive a mathematical formula showing that the CI width decreases as the square root of the experiment duration. This means that to achieve a target CI width, the required experiment duration scales quadratically. For example, to halve the CI width, you need to run the experiment four times as long.

Building on this insight, the paper presents a method to estimate the minimum experiment duration required to reach a desired CI width. This involves calculating the necessary sample size based on factors like the expected effect size and desired statistical power. The authors demonstrate how this approach can be applied in practice.

Critical Analysis

The paper provides a rigorous, well-grounded framework for setting the duration of online A/B experiments. The authors acknowledge some limitations, such as the assumption of a normal distribution for the metric of interest.

One potential issue not addressed is the impact of platform changes, seasonality, or other external factors that could affect the stability of the experiment environment over time. These factors may introduce biases that the proposed duration-setting method does not account for.

Additionally, the paper focuses on setting the overall experiment duration, but does not delve into techniques for dynamically adjusting duration based on interim results or other considerations. Integrating such dynamic approaches could further improve the efficiency of online experiments.

Conclusion

This paper presents a principled framework for determining the appropriate duration of online A/B experiments. By understanding the relationship between confidence intervals and experiment duration, companies can set the right experiment length upfront to ensure reliable results without wasting time or resources.

The insights from this research can help optimize the experimentation process, enabling companies to make better-informed decisions about product changes and improvements. As online experimentation continues to grow in importance, tools like this will become increasingly valuable for data-driven organizations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Setting the duration of online A/B experiments

Harrison H. Li, Chaoyu Yu

In designing an online A/B experiment, it is crucial to select a sample size and duration that ensure the resulting confidence interval (CI) for the treatment effect is the right width to detect an effect of meaningful magnitude with sufficient statistical power without wasting resources. While the relationship between sample size and CI width is well understood, the effect of experiment duration on CI width remains less clear. This paper provides an analytical formula for the width of a CI based on a ratio treatment effect estimator as a function of both sample size (N) and duration (T). The formula is derived from a mixed effects model with two variance components. One component, referred to as the temporal variance, persists over time for experiments where the same users are kept in the same experiment arm across different days. The remaining error variance component, by contrast, decays to zero as T gets large. The formula we derive introduces a key parameter that we call the user-specific temporal correlation (UTC), which quantifies the relative sizes of the two variance components and can be estimated from historical experiments. Higher UTC indicates a slower decay in CI width over time. On the other hand, when the UTC is 0 -- as for experiments where users shuffle in and out of the experiment across days -- the CI width decays at the standard parametric 1/T rate. We also study how access to pre-period data for the users in the experiment affects the CI width decay. We show our formula closely explains CI widths on real A/B experiments at YouTube.

8/7/2024

Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

Yu Liu, Runzhe Wan, James McQueen, Doug Hains, Jinxiang Gu, Rui Song

The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency. Traditionally, experimenters determine AES based on domain knowledge. However, this method becomes impractical for online experimentation services managing numerous experiments, and a more automated approach is hence of great demand. We initiate the study of data-driven AES selection in for online experimentation services by introducing two solutions. The first employs a three-layer Gaussian Mixture Model considering the heteroskedasticity across experiments, and it seeks to estimate the true expected effect size among positive experiments. The second method, grounded in utility theory, aims to determine the optimal effect size by striking a balance between the experiment's cost and the precision of decision-making. Through comparisons with baseline methods using both simulated and real data, we showcase the superior performance of the proposed approaches.

4/19/2024

🌀

Data-Driven Switchback Experiments: Theoretical Tradeoffs and Empirical Bayes Designs

Ruoxuan Xiong, Alex Chin, Sean J. Taylor

We study the design and analysis of switchback experiments conducted on a single aggregate unit. The design problem is to partition the continuous time space into intervals and switch treatments between intervals, in order to minimize the estimation error of the treatment effect. We show that the estimation error depends on four factors: carryover effects, periodicity, serially correlated outcomes, and impacts from simultaneous experiments. We derive a rigorous bias-variance decomposition and show the tradeoffs of the estimation error from these factors. The decomposition provides three new insights in choosing a design: First, balancing the periodicity between treated and control intervals reduces the variance; second, switching less frequently reduces the bias from carryover effects while increasing the variance from correlated outcomes, and vice versa; third, randomizing interval start and end points reduces both bias and variance from simultaneous experiments. Combining these insights, we propose a new empirical Bayes design approach. This approach uses prior data and experiments for designing future experiments. We illustrate this approach using real data from a ride-sharing platform, yielding a design that reduces MSE by 33% compared to the status quo design used on the platform.

6/12/2024

Counteracting Duration Bias in Video Recommendation via Counterfactual Watch Time

Haiyuan Zhao, Guohao Cai, Jieming Zhu, Zhenhua Dong, Jun Xu, Ji-Rong Wen

In video recommendation, an ongoing effort is to satisfy users' personalized information needs by leveraging their logged watch time. However, watch time prediction suffers from duration bias, hindering its ability to reflect users' interests accurately. Existing label-correction approaches attempt to uncover user interests through grouping and normalizing observed watch time according to video duration. Although effective to some extent, we found that these approaches regard completely played records (i.e., a user watches the entire video) as equally high interest, which deviates from what we observed on real datasets: users have varied explicit feedback proportion when completely playing videos. In this paper, we introduce the counterfactual watch time(CWT), the potential watch time a user would spend on the video if its duration is sufficiently long. Analysis shows that the duration bias is caused by the truncation of CWT due to the video duration limitation, which usually occurs on those completely played records. Besides, a Counterfactual Watch Model (CWM) is proposed, revealing that CWT equals the time users get the maximum benefit from video recommender systems. Moreover, a cost-based transform function is defined to transform the CWT into the estimation of user interest, and the model can be learned by optimizing a counterfactual likelihood function defined over observed user watch times. Extensive experiments on three real video recommendation datasets and online A/B testing demonstrated that CWM effectively enhanced video recommendation accuracy and counteracted the duration bias.

6/14/2024