More on this book
Community
Kindle Notes & Highlights
by
Judea Pearl
Read between
May 21 - June 27, 2018
Unlike correlation and most of the other tools of mainstream statistics, causal analysis requires the user to make a subjective commitment. She must draw a causal diagram that reflects her qualitative belief—or, better yet, the consensus belief of researchers in her field of expertise—about the topology of the causal processes at work. She must abandon the centuries-old dogma of objectivity for objectivity’s sake. Where causation is concerned, a grain of wise subjectivity tells us more about the real world than any amount of objectivity.
what frequentists could not abide was that Bayesians were allowing opinion, in the form of subjective probabilities, to intrude into the pristine kingdom of statistics.
in many cases it can be proven that the influence of prior beliefs vanishes as the size of the data increases, leaving a single objective conclusion in the end. Unfortunately, the acceptance of Bayesian subjectivity in mainstream statistics did nothing to help the acceptance of causal subjectivity, the kind needed to specify a path diagram. Why? The answer rests on a grand linguistic barrier. To articulate subjective assumptions, Bayesian statisticians still use the language of probability, the native language of Galton and Pearson. The assumptions entering causal inference, on the other hand,
...more
the subjective component in causal information does not necessarily diminish over time, even as the amount of data increases. Two people who believe in two different causal diagrams can analyze the same data and may never come to the same conclusion, regardless of how “big” the data are. This is a terrifying prospect for advocates of scientific objectivity, which explains their refusal to accept the inevitability of relying on subjective causal information. On the positive side, causal inference is objective in one critically important sense: once two people agree on their assumptions, it
...more
“IT’S elementary, my dear Watson.” So spoke Sherlock Holmes (at least in the movies) just before dazzling his faithful assistant with one of his famously nonelementary deductions. But in fact, Holmes performed not just deduction, which works from a hypothesis to a conclusion. His great skill was induction, which works in the opposite direction, from evidence to hypothesis. Another of his famous quotes suggests his modus operandi: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” Having induced several hypotheses, Holmes eliminated them one by
...more
Bayesian networks, the machine-reasoning tool that underlies the Bonaparte software, affect our lives in many ways that most people are not aware of. They are used in speech-recognition software, in spam filters, in weather forecasting, in the evaluation of potential oil wells, and in the Food and Drug Administration’s approval process for medical devices.
They are related to causal diagrams in a simple way: a causal diagram is a Bayesian network in which every arrow signifies a direct causal relation, or at least the possibility of one, in the direction of that arrow. Not all Bayesian networks are causal, and in many applications it does not matter. However, if you ever want to ask a rung-two or rung-three query about your Bayesian network, you must draw it with scrupulous attention to causality.
Thomas Bayes, after whom I named the networks in 1985, never dreamed that a formula he derived in the 1750s would one day be used to identify disaster victims. He was concerned only with the probabilities of two events, one (the hypothesis) occurring before the other (the evidence). Nevertheless, causality was very much on his mind. In fact, causal aspirations were the driving force behind his analysis of “inverse probability.”
However, the story would be very different if our patient had a gene that put her at high risk for breast cancer—say, a one-in-twenty chance within the next year. Then a positive test would increase the probability to almost one in three.
The approach was fine in theory, but hard-and-fast rules can rarely capture real-life knowledge. Perhaps without realizing it, we deal with exceptions to rules and uncertainties in evidence all the time. By 1980, it was clear that expert systems struggled with making correct inferences from uncertain knowledge. The computer could not replicate the inferential process of a human expert because the experts themselves were not able to articulate their thinking process within the language provided by the system.
Unfortunately, although ingenious, these approaches suffered a common flaw: they modeled the expert, not the world, and therefore tended to produce unintended results.
The key point is that all the neurons are passing information back and forth, from the top down and from the bottom up and from side to side.
I finally realized that the messages were conditional probabilities in one direction and likelihood ratios in the other.
Applying these two rules repeatedly to every node in the network is called belief propagation.
In my public lectures I often call them “gifts from the gods” because they enable us to test a causal model, discover new models, evaluate effects of interventions, and much more.
That key, which we will learn about in Chapter 7, involves all three junctions, and is called d-separation. This concept tells us, for any given pattern of paths in the model, what patterns of dependencies we should expect in the data. This fundamental connection between causes and probabilities constitutes the main contribution of Bayesian networks to the science of causal inference.
So far I have emphasized only one aspect of Bayesian networks—namely, the diagram and its arrows that preferably point from cause to effect. Indeed, the diagram is like the engine of the Bayesian network. But like any engine, a Bayesian network runs on fuel. The fuel is called a conditional probability table.
The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable “black boxes.” In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs.
By any measure, turbo codes have been a staggering success. Before the turbo revolution, 2G cell phones used “soft decoding” (i.e., probabilities) but not belief propagation. 3G cell phones used Berrou’s turbo codes, and 4G phones used Gallager’s turbo-like codes. From the consumer’s viewpoint, this means that your cell phone uses less energy and the battery lasts longer, because coding and decoding are your cell phone’s most energy-intensive processes. Also, better codes mean that you do not have to be as close to a cell tower to get high-quality transmission. In other words, Bayesian
...more
All the probabilistic properties of Bayesian networks (including the junctions we discussed earlier in this chapter) and the belief propagation algorithms that were developed for them remain valid in causal diagrams. They are in fact indispensable for understanding causal inference.
If, however, the same diagram has been constructed as a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change. In the construction phase, we need to examine each variable, say C, and ask ourselves which other variables it “listens” to before choosing its value. The chain structure A B C means that B listens to A only, C listens to B only, and A listens to no one; that is, it is determined by external forces that are not part of our model.
This has two enormously important implications. First, it tells us that causal assumptions cannot be invented at our whim; they are subject to the scrutiny of data and can be falsified. For instance, if the observed data do not show A and C to be independent, conditional on B, then we can safely conclude that the chain model is incompatible with the data and needs to be discarded (or repaired). Second, the graphical properties of the diagram dictate which causal models can be distinguished by data and which will forever remain indistinguishable, no matter how large the data. For example, we
...more
A to C means that if we could wiggle only A, then we would expect to see a change in the probability of C. A missing arrow from A to C means that in the same experiment we would not see any change in C, once we held constant the parents of C (in other words, B in the example above). Note that the probabilistic expression “once we know the value of B” has given way to the causal expression “once we hold B constant,” which implies that we are physically preventing B from varying and disabling the arrow from A to B. The causal thinking that goes into the construction of the causal network will
...more
Now we come to the second, and perhaps more important, impact of Bayesian networks on causal inference. The relationships that were discovered between the graphical structure of the diagram and the data that it represents now permit us to emulate wiggling without physically doing so. Specifically, applying a smart sequence of conditioning operations enables us to predict the effect of actions or interventions without actually conducting an experiment.
This ability to emulate interventions by smart observations could not have been acquired had the statistical properties of Bayesian networks not been unveiled between 1980 and 1988. We can now decide which set of variables we must measure in order to predict the effects of interventions from observational studies.
Before we turn to the new science of cause and effect—illuminated by causal models—we should first try to understand the strengths and limitations of the old, model-blind science: why randomization is needed to conclude that A causes B and the nature of the threat (called “confounding”) that RCTs are intended to disarm.
After we examine these issues in the light of causal diagrams, we can place randomized controlled trials into their proper context. Either we can view them as a special case of our inference engine, or we can view causal inference as a vast extension of RCTs. Either viewpoint is fine, and perhaps people trained to see RCTs as the arbiter of causation will find the latter more congenial.
randomization actually brings two benefits. First, it eliminates confounder bias (it asks Nature the right question). Second, it enables the researcher to quantify his uncertainty.
Their recognition that nonstatistical criteria were necessary was a great step forward.
Viewed from the perspective of causality, the report was at best a modest success. It clearly established the gravity of causal questions and that data alone could not answer them.
I imagine this is how Hipparchus of Nicaea felt when he discovered he could figure out the height of a pyramid from its shadow on the ground, without actually climbing the pyramid. It was a clear victory of mind over matter.
Now let us return to our central question of when a model can replace an experiment, or when a “do” quantity can be reduced to a “see” quantity.
Note that each rule has a simple syntactic interpretation. Rule 1 permits the addition or deletion of observations. Rule 2 permits the replacement of an intervention with an observation, or vice versa. Rule 3 permits the deletion or addition of interventions. All of these permits are issued under appropriate conditions, which have to be verified in any particular case from the causal diagram.
Both teams were, however, recognized with best student paper awards at the Uncertainty in Artificial Intelligence conference in 2006.
Before declaring total victory, we should discuss one issue with the do-calculus. Like any other calculus, it enables the construction of a proof, but it does not help us find one. It is an excellent verifier of a solution but not such a good searcher for one. If you know the correct sequence of transformations, it is easy to demonstrate to others (who are familiar with Rules 1 to 3) that the do-operator can be eliminated. However, if you do not know the correct sequence, it is not easy to discover it, or even to determine whether one exists.
The number of possibilities is limitless, and the axioms themselves provide no guidance about what to try next. My high school geometry teacher used to say that you need “mathematical eyeglasses.” In mathematical logic, this is known as the “decision problem.” Many logical systems are plagued with intractable decision problems.
Shpitser’s algorithm for finding each and every causal effect does not eliminate the need for the do-calculus. In fact, we need it even more, and for several independent reasons. First, we need it in order to go beyond observational studies. Suppose that worst comes to worst, and our causal model does not permit estimation of the causal effect P(Y | do(X)) from observations alone. Perhaps we also cannot conduct a randomized experiment with random assignment of X. A clever researcher might ask whether we might estimate P(Y | do(X)) by randomizing some other variable, say Z, that is more
...more
Even more problems of this sort arise when we consider problems of transportability or external validity—assessing whether an experimental result will still be valid when transported to a different environment that may differ in several key ways from the one studied. This more ambitious set of questions touches on the heart of scientific methodology, for there is no science without generalization.
In 2015, Bareinboim and I presented a paper at the National Academy of Sciences that solves the problem, provided that you can express your assumptions about both environments with a causal diagram.
Yet another reason that the do-calculus remains important is transparency. As I wrote this chapter, Bareinboim (now a professor at Purdue) sent me a new puzzle: a diagram with just four observed variables, X, Y, Z, and W, and two unobservable variables, U1, U2 (see Figure 7.5). He challenged me to figure out if the effect of X on Y was estimable. There was no way to block the back-door paths and no front-door condition. I tried all my favorite shortcuts and my otherwise trustworthy intuitive arguments, both pro and con, and I couldn’t see how to do it. I could not find a way out of the maze.
...more
Around 2005, Wermuth and Cox became interested in a problem called “sequential decisions” or “time-varying treatments,” which are common, for example, in the treatment of AIDS. Typically treatments are administered over a length of time, and in each time period physicians vary the strength and dosage of a follow-up treatment according to the patient’s condition. The patient’s condition, on the other hand, is influenced by the treatments taken in the past.
It so happens that a single application of the three rules of do-calculus can accomplish this. The moral of the story is nothing but a deep appreciation of the power of mathematics to solve difficult problems, which occasionally entail practical consequences.
He was too talented for any of his high school math teachers to keep him interested. What he eventually accomplished was truly amazing. Verma finally proved what became known as the d-separation property (i.e., the fact that you can use the rules of path blocking to determine which independencies should hold in the data). Astonishingly, he told me that he proved the d-separation property thinking it was a homework problem, not an unsolved conjecture! Sometimes it pays to be young and naive. You can still see his legacy in Rule 1 of the do-calculus and in any imprint that path blocking leaves
...more
Four years later, in April 2001, he stunned the world with a simple graphical criterion that generalizes the front door, the back door, and all doors we could think of at the time. I recall presenting Tian’s criterion at a Santa Fe conference. One by one, leaders in the research community stared at my poster and shook their heads in disbelief. How could such a simple criterion work for all diagrams? Tian (now a professor at Iowa State University) came to our lab with a style of thinking that was foreign to us then, in the 1990s. Our conversations were always loaded with wild metaphors and
...more
Tian’s method, called c-decomposition, enabled Ilya Shpitser to develop his complete algorithm for the do-calculus. The moral: never underestimate the power of a locker-room conversation! Ilya Shpitser came in at the end of the ten-year battle to understand interventions.
Completeness proofs are notoriously difficult and are best avoided by any student who aims to finish his PhD on time.
At a lecture of his in Uppsala, Sweden, I first learned that performing interventions could be thought of as deleting arrows from a causal diagram.
Until then I had been laboring under the same burden as generations of statisticians, trying to think of causality in terms of only one diagram representing one static probability distribution.
Trygve Haavelmo (a Norwegian economist and Nobel laureate), who in 1943 advocated equation modification to represent interventions. Nevertheless, Spirtes’s translation of equation deletion into the world of causal diagrams unleashed an avalanche of new insights and new results. The back-door criterion was one of the first beneficiaries of the translation, while the do-calculus came second. The avalanche, however, is not yet over.
A variable that satisfies these three properties is today called an instrumental variable. Clearly Snow thought of this variable as similar to a coin flip, which simulates a variable with no incoming arrows.