Donut’s Kindle Notes & Highlights for Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

Rate it:

Open Preview

More on this book

Community

Rafael

2 highlights

Brian Lu

25 highlights

Kindle Notes & Highlights

by Donut

See all Donut’s Notes & Highlights

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

by Ron Kohavi

Read between October 18 - November 11, 2021

54%

If the only reason to run a controlled experiment is to measure, we could run the experiment at the maximum power ramp (MPR)1, which often means a 50% traffic allocation to the Treatment providing the highest statistical sensitivity, assuming our goal is to ramp that Treatment to 100%.

54%

Figure 15.1 Four phases of the ramping process

54%

1. Create “rings” of testing populations and gradually expose the Treatment to successive rings to mitigate risk.

54%

a) Whitelisted individuals, such as the team implementing the new feature. You can get verbatim feedback from your team members. b) Company employees, as they are typically more forgiving if there are bad bugs. c) Beta users or insiders who tend to be vocal and loyal, who want to see new features sooner, and who are usually willing to give feedback. d) Data centers to isolate interactions that can be challenging to identify, such as memory leaks (death by slow leak) or other inappropriate use of resources

54%

2. Automatically dialing up traffic until it reaches the desired allocation.

55%

3. Producing real-time or near-real-time measurements on key guardrail metrics.

55%

MPR is the ramp phase dedicated to measuring the impact of the experiment.

55%

This ramp phase must be long enough to capture time-dependent factors.

55%

By the time an experiment is past the MPR phase, there should be no concerns regarding end-user impact. Optimally, operational concerns should also be resolved in earlier ramps.

55%

We have seen increasing popularity in long-term holdouts, also called holdbacks, where certain users do not get exposed to Treatment for a long time.

58%

Two-sample t-tests are the most common statistical significance tests for determining whether the difference we see between Treatment and Control is real or just noise (Student 1908; Wasserman 2004). Two-sample t-tests look at the size of the difference between the two means relative to the variance. The significance of the difference is represented by the p-value. The lower the p-value, the stronger the evidence that the Treatment is different from the Control.

58%

any difference with a p-value smaller than 0.05 is considered “statistically significant,”

« Prev 1 2 3 Next »

See a Problem?

Preview — Trustworthy Online Controlled Experiments by Ron Kohavi