Powerful A/B-Testing Metrics and Where to Find Them

Read original: arXiv:2407.20665 - Published 7/31/2024 by Olivier Jeunen, Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko
Total Score

0

Powerful A/B-Testing Metrics and Where to Find Them

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper presents powerful A/B testing metrics and where to find them
  • It covers the introduction and motivation, methodology and contributions, and empirical results
  • The paper aims to provide insights into effective A/B testing metrics for businesses and researchers

Plain English Explanation

A/B testing is a powerful technique used by businesses to compare the performance of two or more versions of a product, website, or feature. By comparing the results, companies can identify the most effective approach and make data-driven decisions to improve their offerings.

This paper explores some of the most powerful A/B testing metrics and where to find them. The authors explain the importance of choosing the right metrics to measure the success of an A/B test, as this can have a significant impact on the outcome and the decisions made by the business.

The paper also discusses the challenges of large-scale metric computation in online controlled experiments and provides insights into how to overcome these challenges. Additionally, the authors introduce a cost-benefit approach to ranking A/B test results, which can help businesses prioritize the most impactful changes.

The paper also explores the concept of meta-experiments, where businesses experiment with the A/B testing process itself to improve the overall effectiveness of their experimentation efforts. Finally, the authors discuss how A/B testing can change the dynamics of information spreading and the implications for businesses and researchers.

Technical Explanation

The paper begins by introducing the importance of A/B testing and the need for powerful metrics to measure its effectiveness. The authors highlight the challenges in choosing the right metrics, as this can have a significant impact on the outcomes of the test and the decisions made by the business.

The paper then presents the authors' methodology and contributions. They explore various A/B testing metrics, including those that can maximize statistical power and accelerate the testing process. The authors also address the challenges of large-scale metric computation in online controlled experiments and propose a cost-benefit approach to ranking A/B test results.

The paper also delves into the concept of meta-experiments, where businesses experiment with the A/B testing process itself to improve the overall effectiveness of their experimentation efforts. The authors provide insights into how this approach can lead to better decision-making and more impactful changes.

Finally, the paper examines the impact of A/B testing on the dynamics of information spreading. The authors discuss how the use of A/B testing can influence the way information is shared and consumed, and the implications for businesses and researchers.

Critical Analysis

The paper presents a comprehensive and well-researched exploration of A/B testing metrics and their applications. The authors have identified several important challenges and limitations in the current practices and have proposed innovative solutions to address them.

One potential area of concern is the generalizability of the findings, as the paper focuses primarily on online controlled experiments. It would be interesting to see how the proposed approaches might be applied to other types of A/B testing scenarios, such as those in physical retail or service-based industries.

Additionally, the paper does not delve deeply into the ethical considerations of A/B testing, such as the potential for biased or manipulative practices. As A/B testing becomes more widespread, it will be crucial for researchers and practitioners to address these concerns and ensure that the use of these techniques aligns with ethical principles.

Conclusion

This paper provides valuable insights into the world of A/B testing, offering businesses and researchers a comprehensive understanding of powerful metrics and their practical applications. By addressing key challenges and introducing innovative approaches, the authors have contributed to the advancement of A/B testing as a critical tool for data-driven decision-making.

The findings presented in this paper have the potential to significantly impact the way businesses and researchers approach experimentation, leading to more effective and impactful changes. As the field of A/B testing continues to evolve, this research serves as a valuable resource for those seeking to maximize the benefits of this powerful technique.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Powerful A/B-Testing Metrics and Where to Find Them
Total Score

0

Powerful A/B-Testing Metrics and Where to Find Them

Olivier Jeunen, Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome? The question then becomes: how do we assess a supporting metric's utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics' utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. $z$-scores and $p$-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

Read more

7/31/2024

Learning Metrics that Maximise Power for Accelerated A/B-Tests
Total Score

0

Learning Metrics that Maximise Power for Accelerated A/B-Tests

Olivier Jeunen, Aleksei Ustimenko

Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies. A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior. North Star metrics are typically delayed and insensitive. As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent. We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star. We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the $p$-values a metric would have produced on a log of past experiments. We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs. Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star. Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.

Read more

6/14/2024

📶

Total Score

0

Large-Scale Metric Computation in Online Controlled Experiment Platform

Tao Xiong, Yong Wang

Online controlled experiment (also called A/B test or experiment) is the most important tool for decision-making at a wide range of data-driven companies like Microsoft, Google, Meta, etc. Metric computation is the core procedure for reaching a conclusion during an experiment. With the growth of experiments and metrics in an experiment platform, computing metrics efficiently at scale becomes a non-trivial challenge. This work shows how metric computation in WeChat experiment platform can be done efficiently using bit-sliced index (BSI) arithmetic. This approach has been implemented in a real world system and the performance results are presented, showing that the BSI arithmetic approach is very suitable for large-scale metric computation scenarios.

Read more

8/26/2024

🤔

Total Score

0

Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

Pallavi Basu, Ron Berman

A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments.

Read more

7/2/2024