The Book of Why: The New Science of Cause and Effect
Rate it:
Open Preview
Read between May 21 - June 27, 2018
20%
Flag icon
Unlike correlation and most of the other tools of mainstream statistics, causal analysis requires the user to make a subjective commitment. She must draw a causal diagram that reflects her qualitative belief—or, better yet, the consensus belief of researchers in her field of expertise—about the topology of the causal processes at work. She must abandon the centuries-old dogma of objectivity for objectivity’s sake. Where causation is concerned, a grain of wise subjectivity tells us more about the real world than any amount of objectivity.
20%
Flag icon
what frequentists could not abide was that Bayesians were allowing opinion, in the form of subjective probabilities, to intrude into the pristine kingdom of statistics.
20%
Flag icon
in many cases it can be proven that the influence of prior beliefs vanishes as the size of the data increases, leaving a single objective conclusion in the end. Unfortunately, the acceptance of Bayesian subjectivity in mainstream statistics did nothing to help the acceptance of causal subjectivity, the kind needed to specify a path diagram. Why? The answer rests on a grand linguistic barrier. To articulate subjective assumptions, Bayesian statisticians still use the language of probability, the native language of Galton and Pearson. The assumptions entering causal inference, on the other hand, ...more
20%
Flag icon
the subjective component in causal information does not necessarily diminish over time, even as the amount of data increases. Two people who believe in two different causal diagrams can analyze the same data and may never come to the same conclusion, regardless of how “big” the data are. This is a terrifying prospect for advocates of scientific objectivity, which explains their refusal to accept the inevitability of relying on subjective causal information. On the positive side, causal inference is objective in one critically important sense: once two people agree on their assumptions, it ...more
20%
Flag icon
“IT’S elementary, my dear Watson.” So spoke Sherlock Holmes (at least in the movies) just before dazzling his faithful assistant with one of his famously nonelementary deductions. But in fact, Holmes performed not just deduction, which works from a hypothesis to a conclusion. His great skill was induction, which works in the opposite direction, from evidence to hypothesis. Another of his famous quotes suggests his modus operandi: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” Having induced several hypotheses, Holmes eliminated them one by ...more
21%
Flag icon
Bayesian networks, the machine-reasoning tool that underlies the Bonaparte software, affect our lives in many ways that most people are not aware of. They are used in speech-recognition software, in spam filters, in weather forecasting, in the evaluation of potential oil wells, and in the Food and Drug Administration’s approval process for medical devices.
21%
Flag icon
They are related to causal diagrams in a simple way: a causal diagram is a Bayesian network in which every arrow signifies a direct causal relation, or at least the possibility of one, in the direction of that arrow. Not all Bayesian networks are causal, and in many applications it does not matter. However, if you ever want to ask a rung-two or rung-three query about your Bayesian network, you must draw it with scrupulous attention to causality.
21%
Flag icon
Thomas Bayes, after whom I named the networks in 1985, never dreamed that a formula he derived in the 1750s would one day be used to identify disaster victims. He was concerned only with the probabilities of two events, one (the hypothesis) occurring before the other (the evidence). Nevertheless, causality was very much on his mind. In fact, causal aspirations were the driving force behind his analysis of “inverse probability.”
23%
Flag icon
However, the story would be very different if our patient had a gene that put her at high risk for breast cancer—say, a one-in-twenty chance within the next year. Then a positive test would increase the probability to almost one in three.
24%
Flag icon
The approach was fine in theory, but hard-and-fast rules can rarely capture real-life knowledge. Perhaps without realizing it, we deal with exceptions to rules and uncertainties in evidence all the time. By 1980, it was clear that expert systems struggled with making correct inferences from uncertain knowledge. The computer could not replicate the inferential process of a human expert because the experts themselves were not able to articulate their thinking process within the language provided by the system.
24%
Flag icon
Unfortunately, although ingenious, these approaches suffered a common flaw: they modeled the expert, not the world, and therefore tended to produce unintended results.
24%
Flag icon
The key point is that all the neurons are passing information back and forth, from the top down and from the bottom up and from side to side.
24%
Flag icon
I finally realized that the messages were conditional probabilities in one direction and likelihood ratios in the other.
24%
Flag icon
Applying these two rules repeatedly to every node in the network is called belief propagation.
25%
Flag icon
In my public lectures I often call them “gifts from the gods” because they enable us to test a causal model, discover new models, evaluate effects of interventions, and much more.
25%
Flag icon
That key, which we will learn about in Chapter 7, involves all three junctions, and is called d-separation. This concept tells us, for any given pattern of paths in the model, what patterns of dependencies we should expect in the data. This fundamental connection between causes and probabilities constitutes the main contribution of Bayesian networks to the science of causal inference.
25%
Flag icon
So far I have emphasized only one aspect of Bayesian networks—namely, the diagram and its arrows that preferably point from cause to effect. Indeed, the diagram is like the engine of the Bayesian network. But like any engine, a Bayesian network runs on fuel. The fuel is called a conditional probability table.
27%
Flag icon
The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable “black boxes.” In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs.
27%
Flag icon
By any measure, turbo codes have been a staggering success. Before the turbo revolution, 2G cell phones used “soft decoding” (i.e., probabilities) but not belief propagation. 3G cell phones used Berrou’s turbo codes, and 4G phones used Gallager’s turbo-like codes. From the consumer’s viewpoint, this means that your cell phone uses less energy and the battery lasts longer, because coding and decoding are your cell phone’s most energy-intensive processes. Also, better codes mean that you do not have to be as close to a cell tower to get high-quality transmission. In other words, Bayesian ...more
27%
Flag icon
All the probabilistic properties of Bayesian networks (including the junctions we discussed earlier in this chapter) and the belief propagation algorithms that were developed for them remain valid in causal diagrams. They are in fact indispensable for understanding causal inference.
28%
Flag icon
If, however, the same diagram has been constructed as a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change. In the construction phase, we need to examine each variable, say C, and ask ourselves which other variables it “listens” to before choosing its value. The chain structure A B C means that B listens to A only, C listens to B only, and A listens to no one; that is, it is determined by external forces that are not part of our model.
28%
Flag icon
This has two enormously important implications. First, it tells us that causal assumptions cannot be invented at our whim; they are subject to the scrutiny of data and can be falsified. For instance, if the observed data do not show A and C to be independent, conditional on B, then we can safely conclude that the chain model is incompatible with the data and needs to be discarded (or repaired). Second, the graphical properties of the diagram dictate which causal models can be distinguished by data and which will forever remain indistinguishable, no matter how large the data. For example, we ...more
28%
Flag icon
A to C means that if we could wiggle only A, then we would expect to see a change in the probability of C. A missing arrow from A to C means that in the same experiment we would not see any change in C, once we held constant the parents of C (in other words, B in the example above). Note that the probabilistic expression “once we know the value of B” has given way to the causal expression “once we hold B constant,” which implies that we are physically preventing B from varying and disabling the arrow from A to B. The causal thinking that goes into the construction of the causal network will ...more
28%
Flag icon
Now we come to the second, and perhaps more important, impact of Bayesian networks on causal inference. The relationships that were discovered between the graphical structure of the diagram and the data that it represents now permit us to emulate wiggling without physically doing so. Specifically, applying a smart sequence of conditioning operations enables us to predict the effect of actions or interventions without actually conducting an experiment.
28%
Flag icon
This ability to emulate interventions by smart observations could not have been acquired had the statistical properties of Bayesian networks not been unveiled between 1980 and 1988. We can now decide which set of variables we must measure in order to predict the effects of interventions from observational studies.
28%
Flag icon
Before we turn to the new science of cause and effect—illuminated by causal models—we should first try to understand the strengths and limitations of the old, model-blind science: why randomization is needed to conclude that A causes B and the nature of the threat (called “confounding”) that RCTs are intended to disarm.
29%
Flag icon
After we examine these issues in the light of causal diagrams, we can place randomized controlled trials into their proper context. Either we can view them as a special case of our inference engine, or we can view causal inference as a vast extension of RCTs. Either viewpoint is fine, and perhaps people trained to see RCTs as the arbiter of causation will find the latter more congenial.
31%
Flag icon
randomization actually brings two benefits. First, it eliminates confounder bias (it asks Nature the right question). Second, it enables the researcher to quantify his uncertainty.
39%
Flag icon
Their recognition that nonstatistical criteria were necessary was a great step forward.
39%
Flag icon
Viewed from the perspective of causality, the report was at best a modest success. It clearly established the gravity of causal questions and that data alone could not answer them.
49%
Flag icon
I imagine this is how Hipparchus of Nicaea felt when he discovered he could figure out the height of a pyramid from its shadow on the ground, without actually climbing the pyramid. It was a clear victory of mind over matter.
49%
Flag icon
Now let us return to our central question of when a model can replace an experiment, or when a “do” quantity can be reduced to a “see” quantity.
50%
Flag icon
Note that each rule has a simple syntactic interpretation. Rule 1 permits the addition or deletion of observations. Rule 2 permits the replacement of an intervention with an observation, or vice versa. Rule 3 permits the deletion or addition of interventions. All of these permits are issued under appropriate conditions, which have to be verified in any particular case from the causal diagram.
50%
Flag icon
Both teams were, however, recognized with best student paper awards at the Uncertainty in Artificial Intelligence conference in 2006.
50%
Flag icon
Before declaring total victory, we should discuss one issue with the do-calculus. Like any other calculus, it enables the construction of a proof, but it does not help us find one. It is an excellent verifier of a solution but not such a good searcher for one. If you know the correct sequence of transformations, it is easy to demonstrate to others (who are familiar with Rules 1 to 3) that the do-operator can be eliminated. However, if you do not know the correct sequence, it is not easy to discover it, or even to determine whether one exists.
50%
Flag icon
The number of possibilities is limitless, and the axioms themselves provide no guidance about what to try next. My high school geometry teacher used to say that you need “mathematical eyeglasses.” In mathematical logic, this is known as the “decision problem.” Many logical systems are plagued with intractable decision problems.
50%
Flag icon
Shpitser’s algorithm for finding each and every causal effect does not eliminate the need for the do-calculus. In fact, we need it even more, and for several independent reasons. First, we need it in order to go beyond observational studies. Suppose that worst comes to worst, and our causal model does not permit estimation of the causal effect P(Y | do(X)) from observations alone. Perhaps we also cannot conduct a randomized experiment with random assignment of X. A clever researcher might ask whether we might estimate P(Y | do(X)) by randomizing some other variable, say Z, that is more ...more
51%
Flag icon
Even more problems of this sort arise when we consider problems of transportability or external validity—assessing whether an experimental result will still be valid when transported to a different environment that may differ in several key ways from the one studied. This more ambitious set of questions touches on the heart of scientific methodology, for there is no science without generalization.
51%
Flag icon
In 2015, Bareinboim and I presented a paper at the National Academy of Sciences that solves the problem, provided that you can express your assumptions about both environments with a causal diagram.
51%
Flag icon
Yet another reason that the do-calculus remains important is transparency. As I wrote this chapter, Bareinboim (now a professor at Purdue) sent me a new puzzle: a diagram with just four observed variables, X, Y, Z, and W, and two unobservable variables, U1, U2 (see Figure 7.5). He challenged me to figure out if the effect of X on Y was estimable. There was no way to block the back-door paths and no front-door condition. I tried all my favorite shortcuts and my otherwise trustworthy intuitive arguments, both pro and con, and I couldn’t see how to do it. I could not find a way out of the maze. ...more
51%
Flag icon
Around 2005, Wermuth and Cox became interested in a problem called “sequential decisions” or “time-varying treatments,” which are common, for example, in the treatment of AIDS. Typically treatments are administered over a length of time, and in each time period physicians vary the strength and dosage of a follow-up treatment according to the patient’s condition. The patient’s condition, on the other hand, is influenced by the treatments taken in the past.
51%
Flag icon
It so happens that a single application of the three rules of do-calculus can accomplish this. The moral of the story is nothing but a deep appreciation of the power of mathematics to solve difficult problems, which occasionally entail practical consequences.
51%
Flag icon
He was too talented for any of his high school math teachers to keep him interested. What he eventually accomplished was truly amazing. Verma finally proved what became known as the d-separation property (i.e., the fact that you can use the rules of path blocking to determine which independencies should hold in the data). Astonishingly, he told me that he proved the d-separation property thinking it was a homework problem, not an unsolved conjecture! Sometimes it pays to be young and naive. You can still see his legacy in Rule 1 of the do-calculus and in any imprint that path blocking leaves ...more
51%
Flag icon
Four years later, in April 2001, he stunned the world with a simple graphical criterion that generalizes the front door, the back door, and all doors we could think of at the time. I recall presenting Tian’s criterion at a Santa Fe conference. One by one, leaders in the research community stared at my poster and shook their heads in disbelief. How could such a simple criterion work for all diagrams? Tian (now a professor at Iowa State University) came to our lab with a style of thinking that was foreign to us then, in the 1990s. Our conversations were always loaded with wild metaphors and ...more
52%
Flag icon
Tian’s method, called c-decomposition, enabled Ilya Shpitser to develop his complete algorithm for the do-calculus. The moral: never underestimate the power of a locker-room conversation! Ilya Shpitser came in at the end of the ten-year battle to understand interventions.
52%
Flag icon
Completeness proofs are notoriously difficult and are best avoided by any student who aims to finish his PhD on time.
52%
Flag icon
At a lecture of his in Uppsala, Sweden, I first learned that performing interventions could be thought of as deleting arrows from a causal diagram.
52%
Flag icon
Until then I had been laboring under the same burden as generations of statisticians, trying to think of causality in terms of only one diagram representing one static probability distribution.
52%
Flag icon
Trygve Haavelmo (a Norwegian economist and Nobel laureate), who in 1943 advocated equation modification to represent interventions. Nevertheless, Spirtes’s translation of equation deletion into the world of causal diagrams unleashed an avalanche of new insights and new results. The back-door criterion was one of the first beneficiaries of the translation, while the do-calculus came second. The avalanche, however, is not yet over.
52%
Flag icon
A variable that satisfies these three properties is today called an instrumental variable. Clearly Snow thought of this variable as similar to a coin flip, which simulates a variable with no incoming arrows.