Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing
Rate it:
Open Preview
Kindle Notes & Highlights
23%
Flag icon
1. Crawl: The goal is building the foundational prerequisites, specifically instrumentation and basic data science capabilities, to compute the summary statistics needed for hypothesis testing so that you can design, run, and analyze a few experiments.
23%
Flag icon
2. Walk: The goal shifts from prerequisites and running a few experiments to a focus on defining standard metrics and getting the organization to run more experiments.
23%
Flag icon
3. Run: The goal shifts to running experiments at scale.
23%
Flag icon
4. Fly: Now you are running A/B experiments as the norm for every change.
23%
Flag icon
Engaging in the process of establishing shared goals and agreeing on the high-level goal metrics and guardrail metrics
23%
Flag icon
Setting goals in terms of improvements to metrics instead of goals to ship features X and Y.
23%
Flag icon
Empowering teams to innovate and improve key metrics within the organizational guardrails
23%
Flag icon
Expecting proper instrumentation and high data quality.
23%
Flag icon
Reviewing experiment results, knowing how to interpret them, enforcing standards on interpretation (e.g., to minimize p-hacking (Wikipedia contributors, Data dredging 2019)), and giving transparency to how those results affect decision making.
23%
Flag icon
Ensuring a portfolio of high-risk/high-rewards projects relative to more incremental gain projects, understanding that some will work, and many—even most—will fail.
23%
Flag icon
Supporting long-term learning from experiments,
24%
Flag icon
For education, establishing just-in-time processes during experiment design and experiment analysis can really up-level an organization.
24%
Flag icon
Compute many metrics, ensure that the important metrics, such as the OEC, guardrail, and other related metrics, are highly visible on the experiment dashboard,
24%
Flag icon
Send out newsletters or e-mails about surprising results (failures and successes), meta-analyses over many prior experiments to build intuition, how
24%
Flag icon
Make it hard for experimenters to launch a Treatment if it impacts important metrics negatively.
24%
Flag icon
Embrace learning from failed ideas.
25%
Flag icon
Experiment definition, setup, and management via a user interface (UI) or application programming interface (API) and stored in the experiment system configuration Experiment deployment, both server- and client-side, that covers variant assignment and parameterization Experiment instrumentation Experiment analysis, which includes definition and computation of metrics and statistical tests like p-values.
25%
Flag icon
Figure 4.2 Possible experiment platform architecture.
26%
Flag icon
Deployment usually involves two components: 1. An experimentation infrastructure that provides experiment definitions, variant assignments, and other information 2. Production code changes that implement variant behavior according to the experiment assignment.
26%
Flag icon
The experimentation infrastructure must provide: Variant assignment: Given a user request and its attributes (e.g., country, language, OS, platform), which experiment and variant combinations is that request assigned to? This assignment is based on the experiment specification and a pseudo-random hash of an ID, that is, f(ID). In most cases, to ensure the assignment is consistent for a user, a user ID is used. Variant assignment must also be independent, in that knowing the variant assignment of one user should not tell us anything about variant assignment for a different user. We discuss this ...more
26%
Flag icon
parameterized system, where any possible change that you want to test in an experiment must be controlled by an experiment parameter.
27%
Flag icon
as a system grows to have hundreds to thousands of parameters, even if an experiment likely affects only a few parameters, then optimizing parameter handling, perhaps with caches, becomes critical from a performance perspective.
27%
Flag icon
Single-Layer Method Variant assignment is the process by which users are consistently assigned to an experiment variant.
28%
Flag icon
where each user can be in multiple experiments at the same time.
28%
Flag icon
To ensure orthogonality of experiments across layers, in the assignment of users to buckets, add the layer ID. This is also where you would add, as in the experiment specification discussed above, the layer ID
28%
Flag icon
One possibility is to extend a full factorial experiment design into a full factorial platform design.
28%
Flag icon
The main drawback of this platform design is that it does not avoid potential collisions, where certain Treatments from two different experiments give users a poor experience if they coexist.
28%
Flag icon
In a nested design, system parameters are partitioned into layers so that experiments that in combination may produce a poor user experience must be in the same layer and be prevented by design from running for the same user.
28%
Flag icon
A constraints-based design has experimenters specify the constraints and the system uses a graph-coloring algorithm to ensure that no two experiments that share a concern are exposed to the user.
28%
Flag icon
Use visualization tools to generate per-metric views of all experiment results, which allows stakeholders to closely monitor the global health of key metrics and see which experiments are most impactful.
28%
Flag icon
Visualization tools are a great gateway for accessing institutional memory to capture what was experimented, why the decision was made, and successes and failures that lead to knowledge discovery and learning.
32%
Flag icon
Goal metrics, also called success metrics or true north metrics, reflect what the organization ultimately cares about.
32%
Flag icon
Driver metrics, also called sign post metrics, surrogate metrics, indirect or predictive metrics, tend to be shorter-term, faster-moving, and more-sensitive metrics than goal metrics.
32%
Flag icon
Guardrail metrics guard against violated assumptions and come in two types: metrics that protect the business and metrics that assess the trustworthiness and internal validity of experiment results.
32%
Flag icon
Asset vs. engagement metrics: Asset metrics measure the accumulation of static assets,
32%
Flag icon
Business vs. operational metrics: Business metrics, such as revenue- per-user or daily active user (DAU), track the health of the business.
34%
Flag icon
There are two types of guardrail metrics: trustworthiness-related guardrail metrics and organizational guardrail metrics.
36%
Flag icon
Coming up with a single weighted combination may be hard initially, but you can start with classifying your decisions into four groups: 1. If all key metrics are flat (not statistically significant) or positive (statistically significant), with at least one metrics positive, then ship the change. 2. If all key metrics are flat or negative, with at least one metric negative, then don’t ship the change. 3. If all key metrics are flat, then don’t ship the change and consider either increasing the experiment power, failing fast, or pivoting. 4. If some key metrics are positive and some key metrics ...more
38%
Flag icon
Goodhart’s law as: “When a measure becomes a target, it ceases to be a good measure”
38%
Flag icon
Campbell’s law, named after Donald Campbell, states that “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor”
38%
Flag icon
Finding correlations in historical data does not imply that you can pick a point on a correlational curve by modifying one of the variables and expecting the other to change.
38%
Flag icon
your company can effectively have a digital journal of all changes through experimentation, including descriptions, screen shots, and key results.
38%
Flag icon
with precious and rich data on each change (launched or not). This digital journal is what we refer to as Institutional Memory.
51%
Flag icon
1. Client-side instrumentation can utilize significant CPU cycles and network bandwidth and deplete device batteries,
51%
Flag icon
2. The JavaScript instrumentation can be lossy
51%
Flag icon
d. Client clock can be changed, manually or automatically.
52%
Flag icon
Your choice of metrics and your choice of randomization unit also interact.
53%
Flag icon
Generally, we recommend that the randomization unit be the same as (or coarser than) the analysis unit in the metrics you care about.
54%
Flag icon
ramping process to control unknown risks associated with new feature launches
54%
Flag icon
To measure the impact and Return-On-Investment (ROI) of the Treatment if it launched to 100%. To reduce risk by minimizing damage and cost to users and business during an experiment (i.e., when there is a negative impact). To learn about users’ reactions, ideally by segments, to identify potential bugs, and to inform future plans.