Michael Hayes’s Kindle Notes & Highlights for The Book of Why: The New Science of Cause and Effect (Penguin Science)

Rate it:

Open Preview

More on this book

Community

Phillip Hunter

1 note & 59 highlights

Brad Balderson

42 notes & 42 highlights

1 note & 102 highlights

Joanne McKinnon

4 notes & 5 highlights

Brian Cajes

1 note & 46 highlights

Alexander Telfar

16 notes & 47 highlights

Mark Gerstein

Benjamin Caldwell

Matt

Christopher

Devika

Roozbeh Daneshvar

Harald G.

Vadim Dmitriev

Nick Rong

Bronwyn

Juan Martin

Aurghyadip

Dale Alleshouse

Ian Pitchford

Mario Schlosser

Magnus

Alok Kejriwal

George Leontiev

Tom Semple

Bon Osonwanne

Benjamin

Nancy

Josh

Rahul Krishna

Mike

Eric Yang

Kindle Notes & Highlights

by Michael Hayes

See all Michael’s Notes & Highlights

The Book of Why: The New Science of Cause and Effect (Penguin Science)

by Judea Pearl

34%

If we have data on a sufficient set of deconfounders, it does not matter if we ignore some or even all of the confounders.

34%

a method called the back-door criterion, which unambiguously identifies which variables in a causal diagram are deconfounders.

34%

In some cases we can control for confounding even when we do not have data on a sufficient set of deconfounders. In these cases we can use different adjustment formulas—not the conventional one, which is only appropriate for use with the back-door criterion—and still eradicate all confounding.

34%

The classic forking pattern at the “Age” node tells us that age is a confounder of walking and mortality.

34%

Perhaps the casual walkers were slacking off for a reason; maybe they couldn’t walk as much. Thus, physical condition could be a confounder. We could go on and on like this. What if the light walkers were alcohol drinkers? What if they ate more?

34%

The study has accounted and adjusted for every reasonable factor—age, physical condition, alcohol consumption, diet, and several others.

34%

it’s true that the intense walkers tended to be slightly younger. So the researchers adjusted the death rate for age and found that the difference between casual and intense walkers was still very large.

35%

If we believe that Abbott’s team identified all the important confounders, we must also believe that intentional walking tends to prolong life (at least in Japanese males).

35%

knowing the set of assumptions that stand behind a given conclusion is not less valuable than attempting to circumvent those assumptions with an RCT, which, as we shall see, has complications of its own.

36%

Fisher realized that an uncertain answer to the right question is much better than a highly certain answer to the wrong question. If you ask the genie the wrong question, you will never find out what you want to know. If you ask the right question, getting an answer that is occasionally wrong is much less of a problem. You can still estimate the amount of uncertainty in your answer, because the uncertainty comes from the randomization procedure (which is known) rather than the characteristics of the soil (which are unknown).

36%

randomization actually brings two benefits. First, it eliminates confounder bias (it asks Nature the right question). Second, it enables the researcher to quantify his uncertainty.

36%

such causal estimates produced by observational studies may be labeled “provisional causality,” that is, causality contingent upon the set of assumptions that our causal diagram advertises.

37%

(i) X → Z → Y and (ii) X → M → Y ↓ Z In example (i), Z satisfies conditions (1) and (2) but is not a confounder. It is known as a mediator: it is the variable that explains the causal effect of X on Y. It is a disaster to control for Z if you are trying to find the causal effect of X on Y. If you look only at those individuals in the treatment and control groups for whom Z = 0, then you have completely blocked the effect of X, because it works by changing Z. So you will conclude that X has no effect on Y. This is exactly what Ezra Klein meant when he said, “Sometimes you end up ...more

37%

In example (ii), Z is a proxy for the mediator M. Statisticians very often control for proxies when the actual causal variable can’t be measured; for instance, party affiliation might be used as a proxy for political beliefs.

37%

Because Z isn’t a perfect measure of M, some of the influence of X on Y might “leak through” if you control for Z. Nevertheless, controlling for Z is still a mistake. While the bias might be l...

This highlight has been truncated due to consecutive passage length restrictions.

38%

Ideally, each person would have a sticker on his forehead identifying which group he belonged to. Exchangeability simply means that the percentage of people with each kind of sticker (d percent, c percent, p percent, and i percent, respectively) should be the same in both the treatment and control groups.

38%

Equality among these proportions guarantees that the outcome would be just the same if we switched the treatments and controls. Otherwise, the treatment and control groups are not alike, and our estimate of the effect of the vaccine will be confounded.

38%

They can differ in age, sex, health conditions, and a variety of other characteristics. Only equality among d, c, p, and i determines whether they are exchangeable or not. So exchangeability amounts to equality between two sets of four proportions, a vast reduction in complexity from the altern...

This highlight has been truncated due to consecutive passage length restrictions.

38%

TABLE 4.1. Classification of individuals according ...

This highlight has been truncated due to consecutive passage length restrictions.

39%

Controlling for descendants (or proxies) of a variable is like “partially” controlling for the variable itself. Controlling for a descendant of a mediator partly closes the pipe; controlling for a descendant of a collider partly opens the pipe.

39%

if we have longer pipes with more junctions, like this: A ← B ← C → D ← E → F → G ← H → I → J? The answer is very simple: if a single junction is blocked, then J cannot “find out” about A through this path. So we have many options to block communication between A and J: control for B, control for C, don’t control for D (because it’s a collider), control for E, and so forth. Any one of these is sufficient. This is why the usual statistical procedure of controlling for everything that we can measure is so misguided. In fact, this particular path is blocked if we don’t control for anything!

39%

to deconfound two variables X and Y, we need only block every noncausal path between them without blocking or perturbing any causal paths. More precisely, a back-door path is any path from X to Y that starts with an arrow pointing into X. X and Y will be deconfounded if we block every back-door path (because such paths allow spurious correlation between X and Y). If we do this by controlling for some set of variables Z, we also need to make sure that no member of Z is a descendant of X on a causal path; otherwise we might partly or completely close off that path.

39%

GAME 1.

39%

There are no arrows leading into X, therefore no back-door paths. We don’t need to control for anything. Nevertheless, some researchers would consider B a confounder. It is associated with X because of the chain X → A → B. It is associated with Y among individuals with X = 0 because there is an open path B ← A → Y that does not pass through X. And B is not on the causal path X → A → Y. It therefore passes the three-step “classical epidemiological definition” for confounding, but it does not pass the back-door criterion and will lead to disaster if controlled for.

39%

GAME 2.

39%

variables. (The treatment, as usual, is X.) Now there is one back-door path X ← A → B ← D → E → Y. This path is already blocked by the collider at B, so we don’t need to control for anything.

39%

Many statisticians would control for B or C, thinking there is no harm in doing so as long as they occur before the treatment. A leading statistician even recently wrote, “To avoid conditioning on some observed covariates … is nonscientific ad hockery.” He is wrong; conditioning on B or C is a poor idea because it would open the noncausal path and therefore confound X and Y.

39%

in this case we could reclose the path by controlling for A or D. This example shows that there may be different strategies for deconfounding. One researcher might take the easy way and not control for anything; a more traditional researcher might control for C and D. Both would be correct and should get the sa...

This highlight has been truncated due to consecutive passage length restrictions.

39%

GA...

This highlight has been truncated due to consecutive passage length restrictions.

39%

There is one back-door path from X to Y, X ← B → Y, which can only be blocked by controlling for B. If B is unobservable, then there is no way of estimating the effect of X on Y without running a randomized controlled experiment.

39%

Some (in fact, most) statisticians in this situation would control for A, as a proxy for the unobservable variable B, but this only partially eliminates the confounding bias and introduces a new collider bias.

39%

GA...

This highlight has been truncated due to consecutive passage length restrictions.

39%

called “M-bias” (named for the shape of the graph). Once again there is only one back-door path, and it is already blocked by a collider at B. So we don’t need to control for anything. Nevertheless, all statisticians before 1986 and many today would consider B a confounder.

No front door path?

39%

X and Y are unconfounded if we do not control for B. B only becomes a confounder when you control for it!

39%

seat-belt usage (B) has no causal effect on smoking (X) or lung disease (Y); it is merely an indicator of a person’s attitudes toward societal norms (A) as well as safety and health-related measures (C). Some of these attitudes may affect susceptibility to lung disease (Y). In practice, seatbelt usage was found to be correlated with both X and Y; indeed, in a study conducted in 2006 as part of a tobacco litigation, seat-belt usage was listed as one of the first variables to be controlled for. If you accept the above model, then controlling for B alone would be a mistake.

39%

GAME 5.

40%

Game 5 is just Game 4 with a little extra wrinkle. Now a second back-door path X ← B ← C → Y needs to be closed. If we close this path by controlling for B, then we open up the M-shaped path X ← A → B ← C → Y. To close that path, we must control for A or C as well. However, notice that we could just control for C alone; that would close the path X ← B ← C → Y and not affect the other path.

40%

In Game 1, A represents an underlying abnormality that is induced by smoking; this is not an observable variable because we don’t know what the abnormality is. B represents a history of previous miscarriages.

40%

Game 2 is a more complicated version where there are two different smoking variables: X represents whether the mother smokes now (at the beginning of the second pregnancy), while A represents whether she smoked during the first pregnancy. B and E are underlying abnormalities caused by smoking, which are unobservable, and D represents other physiological causes of those abnormalities.

40%

In Game 4, X represents an individual’s smoking behavior, and Y represents whether the person has asthma as an adult. B represents childhood asthma, which is a collider because it is affected by both A, parental smoking, and C, an underlying (and unobservable) predisposition toward asthma. In Game 5 the variables have the same meanings, but they added two arrows for greater realism. (Game 4 was only meant to introduce the M-graph.)

40%

the full model in their paper has a few more

40%

Forbes and Williamson’s found that smoking had a small and statistically insignificant association with adult asthma in the raw data, and the effect became even smaller and more insignificant after adjusting for the confounders.

41%

a fortunate one-to-one relation between cause and effect. The cholera bacillus is the only cause of cholera; or as we would say today, it is both necessary and sufficient. If you aren’t exposed to it, you won’t get the disease. Likewise, a vitamin C deficiency is necessary to produce scurvy, and given enough time, it is also sufficient.

41%

The smoking-cancer debate challenged this monolithic concept of causation. Many people smoke their whole lives and never get lung cancer. Conversely, some people get lung cancer without ever lighting up a cigarette. Some people may get it because of a hereditary disposition, others because of exposure to carcinogens, and some for both reasons.

41%

The US surgeon general’s report, in 1964, stated in no uncertain terms, “Cigarette smoking is causally related to lung cancer in men.” This blunt statement forever shut down the argument that smoking was “not proven” to cause cancer. The rate of smoking in the United States among men began to decrease the following year and is now less than half what it was in 1964. No doubt millions of lives have been saved and lifespans lengthened.

41%

the surgeon general’s committee relied on an informal series of guidelines, called Hill’s criteria, named for University of London statistician Austin Bradford Hill. Every one of these criteria has demonstrable exceptions, although collectively they have a compelling commonsense value and even wisdom.

41%

Before cigarettes, lung cancer had been so rare that a doctor might encounter it only once in a lifetime of practice. But between 1900 and 1950, the formerly rare disease quadrupled in frequency, and by 1960 it would become the most common form of cancer among men. Such a huge change in the incidence of a lethal disease begged for an explanation. FIGURE 5.1.

42%

case-control study because it compares “cases” (people with a disease) to controls.

42%

researchers can control for confounders like age, sex, and exposure to environmental pollutants. Nevertheless, the case-control design has some obvious drawbacks. It is retrospective; that means we study people known to have cancer and look backward to discover why. The probability logic is backward too. The data tell us the probability that a cancer patient is a smoker instead of the probability that a smoker will get cancer. It is the latter probability that really matters to a person who wants to know whether he should smoke or not.

42%

case-control studies admit several possible sources of bias. One of them is called recall bias: although Doll and Hill ensured that the interviewers didn’t know the diagnoses, the patients certainly knew whether they had cancer or not. This could have affected their recollections. Another problem is selection bias. Hospitalized cancer patients w...

This highlight has been truncated due to consecutive passage length restrictions.

« Prev 1 2 3 4 5 6 Next »

See a Problem?

Preview — The Book of Why by Judea Pearl