More on this book
Kindle Notes & Highlights
by
Ron Kohavi
Read between
October 18 - November 11, 2021
If the only reason to run a controlled experiment is to measure, we could run the experiment at the maximum power ramp (MPR)1, which often means a 50% traffic allocation to the Treatment providing the highest statistical sensitivity, assuming our goal is to ramp that Treatment to 100%.
Figure 15.1 Four phases of the ramping process
1. Create “rings” of testing populations and gradually expose the Treatment to successive rings to mitigate risk.
a) Whitelisted individuals, such as the team implementing the new feature. You can get verbatim feedback from your team members. b) Company employees, as they are typically more forgiving if there are bad bugs. c) Beta users or insiders who tend to be vocal and loyal, who want to see new features sooner, and who are usually willing to give feedback. d) Data centers to isolate interactions that can be challenging to identify, such as memory leaks (death by slow leak) or other inappropriate use of resources
2. Automatically dialing up traffic until it reaches the desired allocation.
3. Producing real-time or near-real-time measurements on key guardrail metrics.
MPR is the ramp phase dedicated to measuring the impact of the experiment.
This ramp phase must be long enough to capture time-dependent factors.
By the time an experiment is past the MPR phase, there should be no concerns regarding end-user impact. Optimally, operational concerns should also be resolved in earlier ramps.
We have seen increasing popularity in long-term holdouts, also called holdbacks, where certain users do not get exposed to Treatment for a long time.
Two-sample t-tests are the most common statistical significance tests for determining whether the difference we see between Treatment and Control is real or just noise (Student 1908; Wasserman 2004). Two-sample t-tests look at the size of the difference between the two means relative to the variance. The significance of the difference is represented by the p-value. The lower the p-value, the stronger the evidence that the Treatment is different from the Control.
any difference with a p-value smaller than 0.05 is considered “statistically significant,”

