Learning Metrics that Maximise Power for Accelerated A/B-Tests

Read original: arXiv:2402.03915 - Published 6/14/2024 by Olivier Jeunen, Aleksei Ustimenko

Learning Metrics that Maximise Power for Accelerated A/B-Tests

Overview

• This paper introduces a new approach for learning metrics that can maximize the statistical power of accelerated A/B testing, which is a commonly used technique for evaluating the performance of online systems.

• The researchers propose a method that learns a metric function from data, optimizing it to detect small but meaningful differences between two conditions more effectively than traditional metrics.

• This can lead to faster and more reliable A/B testing, with the potential to accelerate product development and innovation.

Plain English Explanation

When companies want to test changes to their online products or services, they often use A/B testing. This involves randomly showing different versions (A and B) to users and measuring how they respond. The goal is to determine which version performs better.

The paper "Large-Scale Metric Computation for Online Controlled Experiments" explains how to efficiently compute the metrics needed for A/B testing. The paper "Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies" discusses the challenges of interpreting those metrics.

In this new paper, the researchers go a step further. They develop a way to actually

learn

the best metric to use for A/B testing. Traditional metrics may not be optimized to detect small but important differences between the two versions. The researchers' approach learns a custom metric function that is better suited to the specific A/B test, making it more powerful and efficient.

This could allow companies to run A/B tests faster and make decisions about product changes more quickly. It has the potential to accelerate the pace of innovation and product improvement.

Technical Explanation

The key innovation in this paper is a method for

learning the metric function

used in A/B testing, rather than relying on pre-defined metrics. The researchers formulate this as an optimization problem, where the goal is to find a metric that maximizes the statistical power to detect meaningful differences between the two conditions.

They propose using a flexible, parameterized form for the metric function, such as a neural network. This allows the metric to be adapted to the specific application, rather than using a one-size-fits-all approach. The metric parameters are then optimized using historical data from previous A/B tests.

The technical details involve casting this as a bi-level optimization problem, where the inner loop learns the optimal metric for a given A/B test setup, and the outer loop adjusts the metric function to work well across a range of test scenarios. The researchers demonstrate the effectiveness of their approach through both theoretical analysis and empirical evaluation on real-world data.

Critical Analysis

The researchers acknowledge several limitations and areas for future work. One key issue is the potential for overfitting the learned metric to the historical data, which could reduce its generalization performance. Techniques like cross-validation and explicit regularization may be needed to address this.

Another limitation is the computational complexity of the bi-level optimization process, which could be challenging to scale to very large problems. The researchers suggest exploring approximate or more efficient optimization methods as a potential solution.

Additionally, the paper focuses on a single-outcome setting, but many real-world A/B tests involve multiple, potentially conflicting objectives. The paper "Multi-Objective Recommendation via Multivariate Policy Learning" discusses techniques for handling multiple objectives in a related context.

Finally, the learned metric function is treated as a "black box," which may make it difficult to interpret and understand. The paper "Measuring Model Variability Using Robust Non-Parametric Techniques" explores methods for better understanding the behavior of complex models. Incorporating such interpretability techniques could be a valuable extension of this work.

Conclusion

This paper presents a novel approach for learning optimized metrics for accelerated A/B testing, which can lead to faster and more reliable evaluation of product changes. By adapting the metric function to the specific test scenario, the method can detect small but meaningful differences more effectively than traditional, one-size-fits-all metrics.

While the technical implementation has some limitations that require further research, the core idea of learning custom metrics for A/B testing is a promising direction that could significantly impact the pace of innovation and product development for online businesses and services. Continued advancements in this area have the potential to transform how companies make data-driven decisions and improve their offerings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Learning Metrics that Maximise Power for Accelerated A/B-Tests

Olivier Jeunen, Aleksei Ustimenko

Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies. A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior. North Star metrics are typically delayed and insensitive. As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent. We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star. We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the $p$-values a metric would have produced on a log of past experiments. We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs. Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star. Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.

6/14/2024

Powerful A/B-Testing Metrics and Where to Find Them

Olivier Jeunen, Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome? The question then becomes: how do we assess a supporting metric's utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics' utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. $z$-scores and $p$-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

7/31/2024

🤔

Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

Pallavi Basu, Ron Berman

A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments.

7/2/2024

📶

Large-Scale Metric Computation in Online Controlled Experiment Platform

Tao Xiong, Yong Wang

Online controlled experiment (also called A/B test or experiment) is the most important tool for decision-making at a wide range of data-driven companies like Microsoft, Google, Meta, etc. Metric computation is the core procedure for reaching a conclusion during an experiment. With the growth of experiments and metrics in an experiment platform, computing metrics efficiently at scale becomes a non-trivial challenge. This work shows how metric computation in WeChat experiment platform can be done efficiently using bit-sliced index (BSI) arithmetic. This approach has been implemented in a real world system and the performance results are presented, showing that the BSI arithmetic approach is very suitable for large-scale metric computation scenarios.

8/26/2024