Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing
Rate it:
Open Preview
Kindle Notes & Highlights
54%
Flag icon
If the only reason to run a controlled experiment is to measure, we could run the experiment at the maximum power ramp (MPR)1, which often means a 50% traffic allocation to the Treatment providing the highest statistical sensitivity, assuming our goal is to ramp that Treatment to 100%.
54%
Flag icon
Figure 15.1 Four phases of the ramping process
54%
Flag icon
1. Create “rings” of testing populations and gradually expose the Treatment to successive rings to mitigate risk.
54%
Flag icon
a) Whitelisted individuals, such as the team implementing the new feature. You can get verbatim feedback from your team members. b) Company employees, as they are typically more forgiving if there are bad bugs. c) Beta users or insiders who tend to be vocal and loyal, who want to see new features sooner, and who are usually willing to give feedback. d) Data centers to isolate interactions that can be challenging to identify, such as memory leaks (death by slow leak) or other inappropriate use of resources
54%
Flag icon
2. Automatically dialing up traffic until it reaches the desired allocation.
55%
Flag icon
3. Producing real-time or near-real-time measurements on key guardrail metrics.
55%
Flag icon
MPR is the ramp phase dedicated to measuring the impact of the experiment.
55%
Flag icon
This ramp phase must be long enough to capture time-dependent factors.
55%
Flag icon
By the time an experiment is past the MPR phase, there should be no concerns regarding end-user impact. Optimally, operational concerns should also be resolved in earlier ramps.
55%
Flag icon
We have seen increasing popularity in long-term holdouts, also called holdbacks, where certain users do not get exposed to Treatment for a long time.
58%
Flag icon
Two-sample t-tests are the most common statistical significance tests for determining whether the difference we see between Treatment and Control is real or just noise (Student 1908; Wasserman 2004). Two-sample t-tests look at the size of the difference between the two means relative to the variance. The significance of the difference is represented by the p-value. The lower the p-value, the stronger the evidence that the Treatment is different from the Control.
58%
Flag icon
any difference with a p-value smaller than 0.05 is considered “statistically significant,”
1 3 Next »