More on this book
Kindle Notes & Highlights
by
Ron Kohavi
Read between
October 18 - November 11, 2021
Overall Evaluation Criterion (OEC): A quantitative measure of the experiment‘s objective.
The OEC must be measurable in the short term (the duration of an experiment) yet believed to causally drive long-term strategic objectives
Parameter: A controllable experimental variable that is thought to influence the OEC or other metrics of interest.
Variant: A user experience being tested, typically by assigning values to parameters. In a simple A/B test, A and B are the two variants, usually called Control and Treatment.
Randomization Unit: A pseudo-randomization (e.g., hashing) process is applied to units (e.g., users or pages) to map them to variants.
It is very common, and we highly recommend, to use users as a randomization unit when running controlled experiments for online audiences. Some experimental designs choose to randomize by pages, sessions, or user-day
It is important to note that random does not mean “haphazard or unplanned, but a deliberate choice based on probabilities”
Correlation does not imply causality and overly relying on these observations leads to faulty decisions.
Randomized controlled experiments are the gold standard for establishing causality. Systematic reviews, that is, meta-analysis, of controlled experiments provides more evidence and generalizability. Figure 1.3 A simple hierarchy of evidence for assessing the quality of trial design (Greenhalgh 2014)
We believe online controlled experiments are: The best scientific way to establish causality with high probability. Able to detect small changes that are harder to detect with other techniques, such as changes over time (sensitivity). Able to detect unexpected changes. Often underappreciated, but many experiments uncover surprising impacts on other metrics, be it performance degradation, increased crashes/errors, or cannibalizing clicks from other features.
1. There are experimental units (e.g., users) that can be assigned to different variants with no interference
2. There are enough experimental units
3. Key metrics, ideally an OEC, are agreed upon and can be practically evaluated.
4. Changes are easy to make.
The hard part is finding metrics measurable in a short period, sensitive enough to show differences, and that are predictive of long-term goals. For example, “Profit” is not a good OEC, as short-term theatrics (e.g., raising prices) can increase short-term profit, but may hurt it in the long run. Customer lifetime value is a strategically powerful OEC
Defining guardrail metrics for experiments is important for identifying what the organization is not willing to change, since a strategy also “requires you to make tradeoffs in competing – to choose what not to do”
the ability to run controlled experiments allows you to significantly reduce uncertainty by trying a Minimum Viable Product (Ries 2011), getting data, and iterating. That said, not everyone may have a few years to invest in testing a new strategy, in which case you may need to make decisions in the face of uncertainty. One useful concept to keep in mind is EVI: Expected Value of Information from Douglas Hubbard (2014), which captures how additional information can help you in decision making. The ability to run controlled experiments allows you to significantly reduce uncertainty by trying a
...more
Figure 2.1 A user online shopping funnel. Users may not progress linearly through a funnel, but instead skip, repeat or go back-and-forth between steps
To measure the impact of the change, we need to define goal metrics, or success metrics. When we have just one, we can use that metric directly as our OEC
First, we characterize the metric by understanding the baseline mean value and the standard error of the mean, in other words, how variable the estimate of our metric will be.
in controlled experiments, we have one sample for the Control and one sample for each Treatment.
If it is unlikely, we reject the Null hypothesis and claim that the difference is statistically significant.
we compute the p-value for the difference, which is the probability of observing such difference or more extreme assuming the Null hypothesis is true. We reject the Null hypothesis and conclude that our experiment has an effect (or the result is statistically significant) if the p-value is small enough.
A 95% confidence interval is the range that covers the true difference 95% of the time, and for fairly large sample sizes it is usually centered around the observed delta between the Treatment and the Control with an extension of 1.96 standard errors on each side.
Statistical power is the probability of detecting a meaningful difference between the variants when there really is one (statistically, reject the null when there is a difference).
While “statistical significance” measures how likely the result you observe or more extreme could have happened by chance assuming the null, not all statistically significant results are practically meaningful.
1. What is the randomization unit? 2. What population of randomization units do we want to target? 3. How large (size) does our experiment need to be? 4. How long do we run the experiment?
Targeting a specific population means that you only want to run the experiment for users with a particular characteristic.
More users: In the online experiments, because users trickle into experiments over time, the longer the experiment runs, the more users the experiment gets.
Day-of-week effect: You may have a different population of users on weekends than weekdays.
Seasonality: There can be other times when users behave differently that are important to consider, such as holidays.
Primacy and novelty effects: There are experiments that tend to have a larger or smaller initial effect that takes time to stabilize.
To run an experiment, we need both: Instrumentation to get logs data on how users are interacting with your site and which experiments those interactions belong to (see Chapter 13). Infrastructure to be able to run an experiment, ranging from experiment configuration to variant assignment. See Chapter 4 Experimentation Platform and Culture for more detail.
There are many ways for bugs to creep in that would invalidate the experiment results. To catch them, we’ll look at the guardrail metrics or invariants.
There are two types of invariant metrics: 1. Trust-related guardrail metrics, such as expecting the Control and Treatment samples to be sized according to the configuration or that they have the same cache-hit rates. 2. Organizational guardrail metrics, such as latency, which are important to the organization and expected to be an invariant for many experiments. In the checkout experiment, it would be very surprising if latency changed.
Twyman’s law, perhaps the most important single law in the whole of data analysis… The more unusual or interesting the data, the more likely they are to have been the result of an error of one kind or another
Twyman’s Law: “Any figure that looks interesting or different is usually wrong”
Twyman’s Law: “Any statistic that appears interesting is almost certainly a mistake”
In experimentation, we can run tests that check for underlying issues, similar to asserts: if every user should see either Control or Treatment from a certain time, then having many users in both variants is a red flag; if the experiment design calls for equal percentages in the two variants, then large deviations that are probabilistically unlikely should likewise raise questions.
A common mistake is to assume that just because a metric is not statistically significant, there is no Treatment effect.
The p-value is the probability of obtaining a result equal to or more extreme than what was observed, assuming that the Null hypothesis is true. The conditioning on the Null hypothesis is critical.
When running an online controlled experiment, you could continuously monitor the p-values.
Confidence intervals, loosely speaking, quantify the degree of uncertainty in the Treatment effect. The confidence level represents how often the confidence interval should contain the true Treatment effect. There is a duality between p-values and confidence intervals.
A common mistake is to look at the confidence intervals separately for the Control and Treatment, and assume that if they overlap, the Treatment effect is not statistically different.
Another common misunderstanding about confidence intervals is the belief that the presented 95% confidence interval has a 95% chance of containing the true Treatment effect.
In the analysis of controlled experiments, it is common to apply the Stable Unit Treatment Value Assumption (SUTVA) (Imbens and Rubin 2015), which states that experiment units (e.g., users) do not interfere with one another.
Analyzing users who have been active for some time (e.g., two months) introduces survivorship bias.
In some experiments, there is non-random attrition from the variants.
If the ratio of users (or any randomization unit) between the variants is not close to the designed ratio, the experiment suffers from a Sample Ratio Mismatch (SRM).
External validity refers to the extent to which the results of a controlled experiment can be generalized along axes such as different populations (e.g., other countries, other websites) and over time (e.g., will the 2% revenue increase continue for a long time or diminish?).

