Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

Read original: arXiv:2407.01036 - Published 7/2/2024 by Pallavi Basu, Ron Berman

🤔

Overview

This paper presents a cost-benefit approach to conducting large-scale A/B tests, where the goal is to identify the most beneficial changes to implement rather than simply detect statistically significant differences.
The authors introduce a metric called "ranking by lifts" that considers both the magnitude of the effect and the cost of the change, allowing for more informed decision-making.
The proposed method is evaluated on several real-world datasets, demonstrating its effectiveness in prioritizing high-impact, cost-effective changes.

Plain English Explanation

When companies run A/B tests to compare different versions of a product or feature, the goal is often to find statistically significant differences that indicate which version performs better. However, this approach doesn't always capture the full picture. The authors of this paper propose a different way of analyzing the results, called "ranking by lifts," which takes into account both the size of the effect and the cost of implementing the change.

Imagine you're running an A/B test on a new website feature. Version A might have a slightly higher conversion rate, but the changes required to implement it are extensive and expensive. Version B, on the other hand, has a smaller but still significant improvement and is much cheaper to implement. In this case, Version B may be the better choice overall, even though Version A had a larger statistical impact.

The "ranking by lifts" approach allows you to weigh the benefits of a change against the costs, making it easier to prioritize the most impactful and cost-effective improvements. This can be especially useful for large-scale A/B testing, where you may have many potential changes to consider.

Technical Explanation

The paper introduces a new metric called the "lift," which is defined as the ratio of the effect size to the implementation cost. By ranking the potential changes based on their lifts, the authors propose a way to identify the most beneficial changes to implement.

The authors evaluate their approach on several real-world datasets, including website traffic data and e-commerce sales. They compare the "ranking by lifts" method to traditional statistical significance-based approaches, demonstrating that it can effectively prioritize high-impact, cost-effective changes.

The paper also discusses some of the challenges and limitations of the proposed method, such as the need to accurately estimate implementation costs and the potential for overfitting when dealing with a large number of potential changes.

Critical Analysis

The "ranking by lifts" approach presented in this paper is a compelling idea that addresses an important issue in large-scale A/B testing. By considering both the effect size and the implementation cost, the method can help companies make more informed decisions about which changes to prioritize.

However, the paper does not delve into the potential pitfalls of this approach. For example, accurately estimating implementation costs can be challenging, and relying too heavily on these estimates could lead to suboptimal decisions. Additionally, the paper does not discuss how to handle cases where the effect size and implementation cost are not perfectly correlated, or how to account for other factors that may influence the overall value of a change.

The authors also do not explore the potential biases that may arise when using this method, such as a tendency to favor low-cost changes over more impactful but more expensive ones. Further research may be needed to address these limitations and ensure that the "ranking by lifts" approach is robust and reliable in real-world applications.

Conclusion

The "ranking by lifts" method presented in this paper offers a novel approach to large-scale A/B testing, shifting the focus from simply detecting statistically significant differences to identifying the most beneficial changes to implement. By considering both the effect size and the implementation cost, the proposed method can help companies make more informed decisions and prioritize the changes that will have the greatest impact.

While the paper demonstrates the potential of this approach, further research and validation may be needed to address its limitations and ensure its widespread applicability. Nonetheless, the "ranking by lifts" concept represents an important step forward in the field of A/B testing and decision-making, with implications for various industries and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

Pallavi Basu, Ron Berman

A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments.

7/2/2024

Online Local False Discovery Rate Control: A Resource Allocation Approach

Ruicheng Ao, Hongyu Chen, David Simchi-Levi, Feng Zhu

We consider the problem of sequentially conducting multiple experiments where each experiment corresponds to a hypothesis testing task. At each time point, the experimenter must make an irrevocable decision of whether to reject the null hypothesis (or equivalently claim a discovery) before the next experimental result arrives. The goal is to maximize the number of discoveries while maintaining a low error rate at all time points measured by Local False Discovery Rate (LFDR). We formulate the problem as an online knapsack problem with exogenous random budget replenishment. We start with general arrival distributions and show that a simple policy achieves a $O(sqrt{T})$ regret. We complement the result by showing that such regret rate is in general not improvable. We then shift our focus to discrete arrival distributions. We find that many existing re-solving heuristics in the online resource allocation literature, albeit achieve bounded loss in canonical settings, may incur a $Omega(sqrt{T})$ or even a $Omega(T)$ regret. With the observation that canonical policies tend to be too optimistic and over claim discoveries, we propose a novel policy that incorporates budget safety buffers. It turns out that a little more safety can greatly enhance efficiency -- small additional logarithmic buffers suffice to reduce the regret from $Omega(sqrt{T})$ or even $Omega(T)$ to $O(ln^2 T)$. From a practical perspective, we extend the policy to the scenario with continuous arrival distributions, time-dependent information structures, as well as unknown $T$. We conduct both synthetic experiments and empirical applications on a time series data from New York City taxi passengers to validate the performance of our proposed policies. Our results emphasize how effective policies should be designed in online resource allocation problems with exogenous budget replenishment.

7/17/2024

Learning Metrics that Maximise Power for Accelerated A/B-Tests

Olivier Jeunen, Aleksei Ustimenko

Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies. A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior. North Star metrics are typically delayed and insensitive. As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent. We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star. We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the $p$-values a metric would have produced on a log of past experiments. We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs. Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star. Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.

6/14/2024

Powerful A/B-Testing Metrics and Where to Find Them

Olivier Jeunen, Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome? The question then becomes: how do we assess a supporting metric's utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics' utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. $z$-scores and $p$-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

7/31/2024