Brian Lu’s Kindle Notes & Highlights for Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

Bing’s revenue improved so much during the time that each millisecond in improved performance was worth more than in the past; every four milliseconds of improvement funded an engineer for a year!

15%

Statistical power is the probability of detecting a meaningful difference between the variants when there really is one (statistically, reject the null when there is a difference).

15%

1. What is the randomization unit? 2. What population of randomization units do we want to target? 3. How large (size) does our experiment need to be? 4. How long do we run the experiment?

16%

In general, overpowering an experiment is fine and even recommended, as sometimes we need to examine segments (e.g., geographic region or platform) and to ensure that the experiment has sufficient power to detect changes on several key metrics.

18%

Here are some incorrect statements and explanations from A Dirty Dozen: Twelve P-Value Misconceptions (Google Website Optimizer 2008):

18%

1. Use sequential tests with always valid p-values, as suggested by Johari et al. (2017), or a Bayesian testing

19%

With large numbers, a ratio smaller than 0.99 or larger than 1.01 for a design that called for 1.0 more than likely indicates a serious issue.

22%

Ideally, segmenting should be done only by values that are determined prior to the experiment, so that the Treatment could not cause users to change segments, though in practice restricting segments this way may be hard for some use cases.

24%

From that perspective, full transparency on the experiment impact is critical. Here are some ways we found to achieve this:

24%

Figure 4.1 Experimentation Growth over the years for Bing, Google, LinkedIn, and Office. Today, Google, LinkedIn, and Microsoft are at a run rate of over 20,000 controlled experiments/year,

26%

The platform needs some interface and/or tools to easily manage many experiments and their multiple iterations. Functionalities should include: Writing, editing, and saving draft experiment specifications. Comparing the draft iteration of an experiment with the current (running) iteration. Viewing the history or timeline of an experiment (even if it is no longer running). Automatically assigning generated experiment IDs, variants, and iterations and adding them to the experiment specification. These IDs are needed in the experiment instrumentation (discussed later in this chapter). Validating ...more

26%

Beyond these basic checks, especially in the Fly phase when experiments are being run at scale, the platform also needs to support: Automation of how experiments are released and ramped up (see Chapter 15 for more detail) Near-real-time monitoring and alerting, to catch bad experiments early Automated detection and shutdown of bad experiments. These increase the safety of the experiments.

27%

The third architecture removes even the getVariant() call. Instead, early in the flow, variant assignment is done, and a configuration with the variant and all parameter values for that variant and for that user are passed down through the remaining flow. buttonColor = config.getParam(“buttonColor”)

27%

The third architecture moves variant assignment early, so handling triggering is more challenging. However, it can also be more performant: as a system grows to have hundreds to thousands of parameters, even if an experiment likely affects only a few parameters, then optimizing parameter handling, perhaps with caches, becomes critical from a performance perspective.

27%

Google shifted from the first architecture to the third based on a combination of performance reasons as well as the technical debt and the challenges of reconciling code paths when it came time to merge back into a single path to make future changes easier.

27%

To manage the concurrency, LinkedIn, Bing, and Google all started with manual methods (at LinkedIn, teams would negotiate traffic “ranges” using e-mails; at Bing, it was managed by a program manager, whose office was usually packed with people begging for experimental traffic; while at Google, it started with e-mail and instant messaging negotiation, before moving to a program manager). However, the manual methods do not scale, so all three companies shifted to programmatic assignment over time.

28%

That said, a factorial platform design might be preferred if the reduction on statistical power when splitting up traffic outweighs the potential concern of interaction.

28%

This requires data computation of metrics (e.g. OEC, guardrail metrics, quality metrics) by segments (e.g., country, language, device/platform), computations of p-values/confidence intervals, also trustworthiness checks, such as the SRM check.

28%

While education can help, options in the tool to use p-value thresholds smaller than the standard 0.05 value are effective. Lower thresholds allow experimenters to quickly filter to the most significant metrics (Xu et al. 2015).

28%

Visualization tools are a great gateway for accessing institutional memory to capture what was experimented, why the decision was made, and successes and failures that lead to knowledge discovery and learning.

28%

For example, through mining historical experiments, you can run a meta-analysis on which kind of experiments tend to move certain metrics, and which metrics tend to move together (beyond their natural correlation).

29%

At Amazon, a 100 msec slowdown experiment decreased sales by 1%

29%

These are blocking snippets that slow the page significantly because they require a roundtrip to the snippet provider and transfer the JavaScript, which is typically tens of kilobytes (Schrijvers 2017, Optimizely 2018b). Putting the snippet lower on the page results in page flashing. Based on latency experiment results, any increase in goal metrics might be offset by the cost of the latency increase.

31%

Finally, there is a question of whether speedup is more important on the first page or later pages in the session. Some speedup techniques (e.g., caching of JavaScript) can improve the performance of later pages in a session. Given the above factors, slowdowns of 100 msec and 250 msec were determined to be reasonable choices by Bing.

31%

While speed matters a lot, we have also seen some results we believe are overstated. In a Web 2.0 talk by Marissa Mayer, then at Google, she described an experiment where Google increased the number of search results on the Search Engine Result Page (SERP) from ten to thirty (Linden 2006). She claimed that traffic and revenue from Google searchers in the experimental group dropped by 20%. Her explanation? The page took half a second more to generate. Performance is a critical factor, but multiple factors were changed, and we suspect that the performance only accounts for a small percentage of ...more