Large-Scale Metric Computation in Online Controlled Experiment Platform

Read original: arXiv:2405.08411 - Published 8/26/2024 by Tao Xiong, Yong Wang

📶

Overview

Online controlled experiments, also known as A/B tests, are a crucial tool for data-driven decision-making at major companies.
Metric computation is the core process for reaching conclusions during an experiment.
As experiments and metrics grow in scale, efficiently computing metrics becomes a significant challenge.

Plain English Explanation

When companies like Microsoft, Google, and Meta want to test new features or ideas, they often use online controlled experiments, or A/B tests. These experiments allow them to compare different versions and see which performs better. The key to these experiments is measuring specific metrics, like how many users click on a button or how long they spend on a page.

As companies run more and more of these experiments, and track more and more metrics, the amount of data and calculations involved can become overwhelming. The paper presented here shows how the WeChat experiment platform is able to efficiently compute these metrics using a technique called bit-sliced indexing. This approach allows them to handle the large scale of modern experiment platforms more effectively.

Technical Explanation

The paper describes how the WeChat experiment platform uses bit-sliced index (BSI) arithmetic to efficiently compute metrics during online controlled experiments. BSI is a data structure and set of operations that allows for fast, scalable metric calculation, even as the number of experiments and metrics grows.

The key idea is to store metric data in a way that allows certain common calculations to be performed very efficiently using bitwise operations. This avoids the need for more costly database queries or other conventional approaches. The paper provides details on the BSI data structure and the specific arithmetic operations used.

The authors implemented this BSI-based metric computation in WeChat's real-world experiment platform and present performance results. They show that this approach is highly suitable for large-scale metric computation scenarios, delivering significant efficiency gains compared to other methods.

Critical Analysis

The paper provides a compelling solution to the challenge of metric computation at scale for online experiments. The use of bit-sliced indexing seems well-suited to the problem domain and the results demonstrate measurable performance improvements.

That said, the paper does not delve into potential limitations or caveats of the approach. For example, it's not clear how the BSI technique would handle more complex metric calculations beyond the basic ones described. There may also be tradeoffs in terms of storage overhead or the ability to perform ad-hoc queries that are not optimized for the BSI structure.

Additionally, the paper is focused on the WeChat platform specifically. While the general principles may apply to other experiment platforms, further research would be needed to understand how well the BSI approach generalizes, especially in the context of partial network information or other unique experiment scenarios.

Conclusion

This paper presents an innovative solution to the challenge of efficiently computing metrics for large-scale online controlled experiments. The use of bit-sliced indexing allows the WeChat experiment platform to handle the growing volume of experiments and metrics in a scalable and performant way.

While the specific implementation details may be most relevant to the WeChat use case, the general principles of leveraging specialized data structures and arithmetic operations to optimize metric computation could be applicable to a wider range of experiment platforms. As online experimentation continues to grow in importance for data-driven decision making, techniques like the one described in this paper will become increasingly valuable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📶

Large-Scale Metric Computation in Online Controlled Experiment Platform

Tao Xiong, Yong Wang

Online controlled experiment (also called A/B test or experiment) is the most important tool for decision-making at a wide range of data-driven companies like Microsoft, Google, Meta, etc. Metric computation is the core procedure for reaching a conclusion during an experiment. With the growth of experiments and metrics in an experiment platform, computing metrics efficiently at scale becomes a non-trivial challenge. This work shows how metric computation in WeChat experiment platform can be done efficiently using bit-sliced index (BSI) arithmetic. This approach has been implemented in a real world system and the performance results are presented, showing that the BSI arithmetic approach is very suitable for large-scale metric computation scenarios.

8/26/2024

Powerful A/B-Testing Metrics and Where to Find Them

Olivier Jeunen, Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome? The question then becomes: how do we assess a supporting metric's utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics' utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. $z$-scores and $p$-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

7/31/2024

Learning Metrics that Maximise Power for Accelerated A/B-Tests

Olivier Jeunen, Aleksei Ustimenko

Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies. A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior. North Star metrics are typically delayed and insensitive. As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent. We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star. We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the $p$-values a metric would have produced on a log of past experiments. We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs. Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star. Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.

6/14/2024

🤔

Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

Pallavi Basu, Ron Berman

A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments.

7/2/2024