Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing
Rate it:
Open Preview
Kindle Notes & Highlights
8%
Flag icon
Overall Evaluation Criterion (OEC): A quantitative measure of the experiment‘s objective.
8%
Flag icon
The OEC must be measurable in the short term (the duration of an experiment) yet believed to causally drive long-term strategic objectives
8%
Flag icon
Parameter: A controllable experimental variable that is thought to influence the OEC or other metrics of interest.
8%
Flag icon
Variant: A user experience being tested, typically by assigning values to parameters. In a simple A/B test, A and B are the two variants, usually called Control and Treatment.
8%
Flag icon
Randomization Unit: A pseudo-randomization (e.g., hashing) process is applied to units (e.g., users or pages) to map them to variants.
8%
Flag icon
It is very common, and we highly recommend, to use users as a randomization unit when running controlled experiments for online audiences. Some experimental designs choose to randomize by pages, sessions, or user-day
8%
Flag icon
It is important to note that random does not mean “haphazard or unplanned, but a deliberate choice based on probabilities”
8%
Flag icon
Correlation does not imply causality and overly relying on these observations leads to faulty decisions.
8%
Flag icon
Randomized controlled experiments are the gold standard for establishing causality. Systematic reviews, that is, meta-analysis, of controlled experiments provides more evidence and generalizability. Figure 1.3 A simple hierarchy of evidence for assessing the quality of trial design (Greenhalgh 2014)
9%
Flag icon
We believe online controlled experiments are: The best scientific way to establish causality with high probability. Able to detect small changes that are harder to detect with other techniques, such as changes over time (sensitivity). Able to detect unexpected changes. Often underappreciated, but many experiments uncover surprising impacts on other metrics, be it performance degradation, increased crashes/errors, or cannibalizing clicks from other features.
9%
Flag icon
1. There are experimental units (e.g., users) that can be assigned to different variants with no interference
9%
Flag icon
2. There are enough experimental units
9%
Flag icon
3. Key metrics, ideally an OEC, are agreed upon and can be practically evaluated.
9%
Flag icon
4. Changes are easy to make.
9%
Flag icon
The hard part is finding metrics measurable in a short period, sensitive enough to show differences, and that are predictive of long-term goals. For example, “Profit” is not a good OEC, as short-term theatrics (e.g., raising prices) can increase short-term profit, but may hurt it in the long run. Customer lifetime value is a strategically powerful OEC
12%
Flag icon
Defining guardrail metrics for experiments is important for identifying what the organization is not willing to change, since a strategy also “requires you to make tradeoffs in competing – to choose what not to do”
13%
Flag icon
the ability to run controlled experiments allows you to significantly reduce uncertainty by trying a Minimum Viable Product (Ries 2011), getting data, and iterating. That said, not everyone may have a few years to invest in testing a new strategy, in which case you may need to make decisions in the face of uncertainty. One useful concept to keep in mind is EVI: Expected Value of Information from Douglas Hubbard (2014), which captures how additional information can help you in decision making. The ability to run controlled experiments allows you to significantly reduce uncertainty by trying a ...more
14%
Flag icon
Figure 2.1 A user online shopping funnel. Users may not progress linearly through a funnel, but instead skip, repeat or go back-and-forth between steps
14%
Flag icon
To measure the impact of the change, we need to define goal metrics, or success metrics. When we have just one, we can use that metric directly as our OEC
14%
Flag icon
First, we characterize the metric by understanding the baseline mean value and the standard error of the mean, in other words, how variable the estimate of our metric will be.
14%
Flag icon
in controlled experiments, we have one sample for the Control and one sample for each Treatment.
14%
Flag icon
If it is unlikely, we reject the Null hypothesis and claim that the difference is statistically significant.
14%
Flag icon
we compute the p-value for the difference, which is the probability of observing such difference or more extreme assuming the Null hypothesis is true. We reject the Null hypothesis and conclude that our experiment has an effect (or the result is statistically significant) if the p-value is small enough.
15%
Flag icon
A 95% confidence interval is the range that covers the true difference 95% of the time, and for fairly large sample sizes it is usually centered around the observed delta between the Treatment and the Control with an extension of 1.96 standard errors on each side.
15%
Flag icon
Statistical power is the probability of detecting a meaningful difference between the variants when there really is one (statistically, reject the null when there is a difference).
15%
Flag icon
While “statistical significance” measures how likely the result you observe or more extreme could have happened by chance assuming the null, not all statistically significant results are practically meaningful.
15%
Flag icon
1. What is the randomization unit? 2. What population of randomization units do we want to target? 3. How large (size) does our experiment need to be? 4. How long do we run the experiment?
15%
Flag icon
Targeting a specific population means that you only want to run the experiment for users with a particular characteristic.
15%
Flag icon
More users: In the online experiments, because users trickle into experiments over time, the longer the experiment runs, the more users the experiment gets.
15%
Flag icon
Day-of-week effect: You may have a different population of users on weekends than weekdays.
15%
Flag icon
Seasonality: There can be other times when users behave differently that are important to consider, such as holidays.
15%
Flag icon
Primacy and novelty effects: There are experiments that tend to have a larger or smaller initial effect that takes time to stabilize.
16%
Flag icon
To run an experiment, we need both: Instrumentation to get logs data on how users are interacting with your site and which experiments those interactions belong to (see Chapter 13). Infrastructure to be able to run an experiment, ranging from experiment configuration to variant assignment. See Chapter 4 Experimentation Platform and Culture for more detail.
16%
Flag icon
There are many ways for bugs to creep in that would invalidate the experiment results. To catch them, we’ll look at the guardrail metrics or invariants.
16%
Flag icon
There are two types of invariant metrics: 1. Trust-related guardrail metrics, such as expecting the Control and Treatment samples to be sized according to the configuration or that they have the same cache-hit rates. 2. Organizational guardrail metrics, such as latency, which are important to the organization and expected to be an invariant for many experiments. In the checkout experiment, it would be very surprising if latency changed.
17%
Flag icon
Twyman’s law, perhaps the most important single law in the whole of data analysis… The more unusual or interesting the data, the more likely they are to have been the result of an error of one kind or another
17%
Flag icon
Twyman’s Law: “Any figure that looks interesting or different is usually wrong”
17%
Flag icon
Twyman’s Law: “Any statistic that appears interesting is almost certainly a mistake”
17%
Flag icon
In experimentation, we can run tests that check for underlying issues, similar to asserts: if every user should see either Control or Treatment from a certain time, then having many users in both variants is a red flag; if the experiment design calls for equal percentages in the two variants, then large deviations that are probabilistically unlikely should likewise raise questions.
17%
Flag icon
A common mistake is to assume that just because a metric is not statistically significant, there is no Treatment effect.
17%
Flag icon
The p-value is the probability of obtaining a result equal to or more extreme than what was observed, assuming that the Null hypothesis is true. The conditioning on the Null hypothesis is critical.
18%
Flag icon
When running an online controlled experiment, you could continuously monitor the p-values.
18%
Flag icon
Confidence intervals, loosely speaking, quantify the degree of uncertainty in the Treatment effect. The confidence level represents how often the confidence interval should contain the true Treatment effect. There is a duality between p-values and confidence intervals.
18%
Flag icon
A common mistake is to look at the confidence intervals separately for the Control and Treatment, and assume that if they overlap, the Treatment effect is not statistically different.
18%
Flag icon
Another common misunderstanding about confidence intervals is the belief that the presented 95% confidence interval has a 95% chance of containing the true Treatment effect.
19%
Flag icon
In the analysis of controlled experiments, it is common to apply the Stable Unit Treatment Value Assumption (SUTVA) (Imbens and Rubin 2015), which states that experiment units (e.g., users) do not interfere with one another.
19%
Flag icon
Analyzing users who have been active for some time (e.g., two months) introduces survivorship bias.
19%
Flag icon
In some experiments, there is non-random attrition from the variants.
19%
Flag icon
If the ratio of users (or any randomization unit) between the variants is not close to the designed ratio, the experiment suffers from a Sample Ratio Mismatch (SRM).
20%
Flag icon
External validity refers to the extent to which the results of a controlled experiment can be generalized along axes such as different populations (e.g., other countries, other websites) and over time (e.g., will the 2% revenue increase continue for a long time or diminish?).
« Prev 1 3