Donut’s Kindle Notes & Highlights for Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

1. Crawl: The goal is building the foundational prerequisites, specifically instrumentation and basic data science capabilities, to compute the summary statistics needed for hypothesis testing so that you can design, run, and analyze a few experiments.

23%

2. Walk: The goal shifts from prerequisites and running a few experiments to a focus on defining standard metrics and getting the organization to run more experiments.

23%

3. Run: The goal shifts to running experiments at scale.

23%

4. Fly: Now you are running A/B experiments as the norm for every change.

23%

Engaging in the process of establishing shared goals and agreeing on the high-level goal metrics and guardrail metrics

23%

Setting goals in terms of improvements to metrics instead of goals to ship features X and Y.

23%

Empowering teams to innovate and improve key metrics within the organizational guardrails

23%

Expecting proper instrumentation and high data quality.

23%

Reviewing experiment results, knowing how to interpret them, enforcing standards on interpretation (e.g., to minimize p-hacking (Wikipedia contributors, Data dredging 2019)), and giving transparency to how those results affect decision making.

23%

Ensuring a portfolio of high-risk/high-rewards projects relative to more incremental gain projects, understanding that some will work, and many—even most—will fail.

23%

Supporting long-term learning from experiments,

24%

For education, establishing just-in-time processes during experiment design and experiment analysis can really up-level an organization.

24%

Compute many metrics, ensure that the important metrics, such as the OEC, guardrail, and other related metrics, are highly visible on the experiment dashboard,

24%

Send out newsletters or e-mails about surprising results (failures and successes), meta-analyses over many prior experiments to build intuition, how

24%

Make it hard for experimenters to launch a Treatment if it impacts important metrics negatively.

24%

Embrace learning from failed ideas.

25%

Experiment definition, setup, and management via a user interface (UI) or application programming interface (API) and stored in the experiment system configuration Experiment deployment, both server- and client-side, that covers variant assignment and parameterization Experiment instrumentation Experiment analysis, which includes definition and computation of metrics and statistical tests like p-values.

25%

Figure 4.2 Possible experiment platform architecture.

26%

Deployment usually involves two components: 1. An experimentation infrastructure that provides experiment definitions, variant assignments, and other information 2. Production code changes that implement variant behavior according to the experiment assignment.

26%

The experimentation infrastructure must provide: Variant assignment: Given a user request and its attributes (e.g., country, language, OS, platform), which experiment and variant combinations is that request assigned to? This assignment is based on the experiment specification and a pseudo-random hash of an ID, that is, f(ID). In most cases, to ensure the assignment is consistent for a user, a user ID is used. Variant assignment must also be independent, in that knowing the variant assignment of one user should not tell us anything about variant assignment for a different user. We discuss this ...more

26%

parameterized system, where any possible change that you want to test in an experiment must be controlled by an experiment parameter.

27%

as a system grows to have hundreds to thousands of parameters, even if an experiment likely affects only a few parameters, then optimizing parameter handling, perhaps with caches, becomes critical from a performance perspective.

27%

Single-Layer Method Variant assignment is the process by which users are consistently assigned to an experiment variant.

28%

where each user can be in multiple experiments at the same time.

28%

To ensure orthogonality of experiments across layers, in the assignment of users to buckets, add the layer ID. This is also where you would add, as in the experiment specification discussed above, the layer ID

28%

One possibility is to extend a full factorial experiment design into a full factorial platform design.

28%

The main drawback of this platform design is that it does not avoid potential collisions, where certain Treatments from two different experiments give users a poor experience if they coexist.

28%

In a nested design, system parameters are partitioned into layers so that experiments that in combination may produce a poor user experience must be in the same layer and be prevented by design from running for the same user.

28%

A constraints-based design has experimenters specify the constraints and the system uses a graph-coloring algorithm to ensure that no two experiments that share a concern are exposed to the user.

28%

Use visualization tools to generate per-metric views of all experiment results, which allows stakeholders to closely monitor the global health of key metrics and see which experiments are most impactful.

28%

Visualization tools are a great gateway for accessing institutional memory to capture what was experimented, why the decision was made, and successes and failures that lead to knowledge discovery and learning.

32%

Goal metrics, also called success metrics or true north metrics, reflect what the organization ultimately cares about.

32%

Driver metrics, also called sign post metrics, surrogate metrics, indirect or predictive metrics, tend to be shorter-term, faster-moving, and more-sensitive metrics than goal metrics.

32%

Guardrail metrics guard against violated assumptions and come in two types: metrics that protect the business and metrics that assess the trustworthiness and internal validity of experiment results.

32%

Asset vs. engagement metrics: Asset metrics measure the accumulation of static assets,

32%

Business vs. operational metrics: Business metrics, such as revenue- per-user or daily active user (DAU), track the health of the business.

34%

There are two types of guardrail metrics: trustworthiness-related guardrail metrics and organizational guardrail metrics.

36%

Coming up with a single weighted combination may be hard initially, but you can start with classifying your decisions into four groups: 1. If all key metrics are flat (not statistically significant) or positive (statistically significant), with at least one metrics positive, then ship the change. 2. If all key metrics are flat or negative, with at least one metric negative, then don’t ship the change. 3. If all key metrics are flat, then don’t ship the change and consider either increasing the experiment power, failing fast, or pivoting. 4. If some key metrics are positive and some key metrics ...more

38%

Goodhart’s law as: “When a measure becomes a target, it ceases to be a good measure”

38%

Campbell’s law, named after Donald Campbell, states that “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor”

38%

Finding correlations in historical data does not imply that you can pick a point on a correlational curve by modifying one of the variables and expecting the other to change.

38%

your company can effectively have a digital journal of all changes through experimentation, including descriptions, screen shots, and key results.

38%

with precious and rich data on each change (launched or not). This digital journal is what we refer to as Institutional Memory.

51%

1. Client-side instrumentation can utilize significant CPU cycles and network bandwidth and deplete device batteries,

51%

2. The JavaScript instrumentation can be lossy

51%

d. Client clock can be changed, manually or automatically.

52%

Your choice of metrics and your choice of randomization unit also interact.

53%

Generally, we recommend that the randomization unit be the same as (or coarser than) the analysis unit in the metrics you care about.

54%

ramping process to control unknown risks associated with new feature launches

54%

To measure the impact and Return-On-Investment (ROI) of the Treatment if it launched to 100%. To reduce risk by minimizing damage and cost to users and business during an experiment (i.e., when there is a negative impact). To learn about users’ reactions, ideally by segments, to identify potential bugs, and to inform future plans.

See a Problem?

Preview — Trustworthy Online Controlled Experiments by Ron Kohavi