More on this book
Community
Kindle Notes & Highlights
A Bayesian statistician, on the other hand, would say, “Wait a minute. We also need to take into account our prior knowledge about the coin.” Did it come from the neighborhood grocery or a shady gambler? If it’s just an ordinary quarter, most of us would not let the coincidence of nine heads sway our belief so dramatically. On the other hand, if we already suspected the coin was weighted, we would conclude more willingly that the nine heads provided serious evidence of bias.
To articulate subjective assumptions, Bayesian statisticians still use the language of probability, the native language of Galton and Pearson. The assumptions entering causal inference, on the other hand, require a richer language (e.g., diagrams) that is foreign to Bayesians and frequentists alike.
Moreover, the subjective component in causal information does not necessarily diminish over time, even as the amount of data increases. Two people who believe in two different causal diagrams can analyze the same data and may never come to the same conclusion, regardless of how “big” the data are. This is a terrifying prospect for advocates of scientific objectivity, which explains their refusal to accept the inevitability of relying on subjective causal information.
How strongly should she believe the hypothesis? Should she have surgery? We can answer these questions by rewriting Bayes’s rule as follows: (Updated probability of D) = P(D | T) = (likelihood ratio) × (prior probability of D)(3.2) where the new term “likelihood ratio” is given by P(T | D)/P(T). It measures how much more likely the positive test is in people with the disease than in the general population.
the new evidence T augments the probability of D by a fixed ratio, no matter what the prior probability was.
the tiny number of true positives (i.e., women with breast cancer) is overwhelmed by the number of false positives. Our sense of surprise at this result comes from the common cognitive confusion between the forward probability, which is well studied and thoroughly documented, and the inverse probability, which is needed for personal decision making.
the story would be very different if our patient had a gene that put her at high risk for breast cancer—say, a one-in-twenty chance within the next year. Then a positive test would increase the probability to almost one in three. For a woman in this situation, the chances that the test provides lifesaving information are much higher. That is why the task force continued to recommend annual mammograms for high-risk women.
P(disease | test) is not the same for everyone; it is context dependent. If you know that you are at high risk for a disease to begin with, Bayes’s rule allows you to factor that information in. Or if you know that you are immune, you need not even bother with the test! In contrast, P(test | disease) does not depend on whether you are at high risk or not. It is “robust” to such variations, which explains to some degree why physicians organize their knowledge and communicate with forward probabilities. The former are properties of the disease itself, its stage of progression, or the sensitivity
...more
The textbook description of the scientific method goes something like this: (1) formulate a hypothesis, (2) deduce a testable consequence of the hypothesis, (3) perform an experiment and collect evidence, and (4) update your belief in the hypothesis.
Lotfi Zadeh of Berkeley offered “fuzzy logic,” in which statements are neither true nor false but instead take a range of possible truth values.
Glen Shafer of the University of Kansas proposed “belief functions,” which assign two probabilities to each fact, one indicating how likely it is to be “possible,” the other, how likely it is to be “provable.” Edward Feigenbaum and his colleagues at Stanford University tried “certainty factors,” which inserted numerical measures of uncertainty into their deterministic rules for inference.
these approaches suffered a common flaw: they modeled the expert, not the world, and therefore tended ...
This highlight has been truncated due to consecutive passage length restrictions.
they could not operate in both diagnostic and predictive modes, the uncontested specialty of Bayes’s rule. In the certainty factor approach, the rule “If fire, then smoke (with certainty c1)” could not combine coherently with “If smoke, then fire (w...
This highlight has been truncated due to consecutive passage length restrictions.
David Rumelhart, a cognitive scientist at University of California, San Diego, and a pioneer of neural networks. His article about children’s reading, published in 1976, made clear that reading is a complex process in which neurons on many different levels are active at the same time (see Figure 3.4). Some of the neurons are simply recognizing individual features—circles or lines. Above them, another layer of neurons is combining these shapes and forming conjectures about what the letter might be.
any artificial intelligence would have to model itself on what we know about human neural information processing and that machine reasoning under uncertainty would have to be constructed with a similar message-passing architecture. But what are the messages? This took me quite a few months to figure out. I finally realized that the messages were conditional probabilities in one direction and likelihood ratios in the other.
The real challenge was to ensure that no matter in what order these messages are sent out, things will settle eventually into a comfortable equilibrium; moreover, the final equilibrium will represent the correct state of belief in the variables. By “correct” I mean, as if we had conducted the computation by textbook methods rather than by message passing.
A → B → C. This junction is the simplest example of a “chain,” or of mediation. In science, one often thinks of B as the mechanism, or “mediator,” that transmits the effect of A to C. A familiar example is Fire → Smoke → Alarm.
If we disable that link in the chain, for instance by sucking all the smoke molecules away with a fume hood, then there will be no alarm. This observation leads to an important conceptual point about chains: the mediator B “screens off” information about A from C, and vice versa.
The process of looking only at rows in the table where Smoke = 1 is called conditioning on a variable. Likewise, we say that Fire and Alarm are conditionally independent, given the value of Smoke.
A ← B → C. This kind of junction is called a “fork,” and B is often called a common cause or confounder of A and C. A confounder will make A and C statistically correlated even though there is no direct causal link between them.
Shoe Size ← Age of Child → Reading Ability. Children with larger shoes tend to read at a higher level. But the relationship is not one of cause and effect. Giving a child larger shoes won’t make him read better! Instead, both variables are explained by a third, which is the child’s age.
We can eliminate this spurious correlation, as Karl Pearson and George Udny Yule called it, by conditioning on the child’s age.
if we look only at seven-year-olds, we expect to see no relationship between shoe size and reading ability.
A → B ← C. This is the most fascinating junction, called a “collider.” Felix Elwert and Chris Winship have illustrated this junction using three features of Hollywood actors: Talent → Celebrity ← Beauty.
both talent and beauty contribute to an actor’s success, but beauty and talent are completely unrelated to one another in the general population.
If A and C are independent to begin with, conditioning on B will make them dependent. For example, if we look only at famous actors (in other words, we observe the variable Celebrity = 1), we will see a negative correlation between talent and beauty: finding out that a celebrity is unattractive increases our belief that he or she is talented.
given the outcome Celebrity = 1, talent and beauty are inversely related—even though they are not related in the population as a whole.
three junctions—chains, forks, and colliders—are
specify the conditional probability of each node given its “parents.” (Remember that the parents of a node are all the nodes that feed into it.)
By depicting A as a root node, we do not really mean that A has no prior causes. Hardly any variable is entitled to such a status. We really mean that any prior causes of A can be adequately summarized in the prior probability P(A) that A is true. For example, in the Disease → Test example, family history might be a cause of Disease.
as long as we are sure that this family history will not affect the variable Test (once we know the status of Disease), we need not represent it as a node in the graph.
However, if there is a cause of Disease that also directly affects Test, then that cause must be represen...
This highlight has been truncated due to consecutive passage length restrictions.
Stefan Conrady and Lionel Jouffe of BayesiaLab,
After one minute, there is still a 47 percent chance that it was on the plane. (Remember that our prior assumption was a 50 percent probability.) After five minutes, the probability drops to 33 percent. After ten minutes, of course, it drops to zero.
Even with this tiny network of three nodes, there were 2 × 11 = 22 parent states, each contributing to the probability of the child state. For a computer, though, such computations are elementary … up to a point. If they aren’t done in an organized fashion, the sheer number of computations can overwhelm even the fastest supercomputer.
another engineer, Robert Gallager of the Massachusetts Institute of Technology, had discovered a code that used belief propagation (though not called by that name) way back in 1960, so long ago that MacKay describes his code as “almost clairvoyant.” In any event, it was too far ahead of its time. Gallager needed thousands of processors on a chip, passing messages back and forth about their degree of belief that a particular information bit was a one or a zero. In 1960 this was impossible, and his code was virtually forgotten until MacKay rediscovered it in 1998. Today, it is in every cell
...more
A Bayesian network is literally nothing more than a compact representation of a huge probability table. The arrows mean only that the probabilities of child nodes are related to the values of parent nodes by a certain formula (the conditional probability tables) and that this relation is sufficient. That is, knowing additional ancestors of the child will not change the formula. Likewise, a missing arrow between any two nodes means that they are independent, once we know the values of their parents. We saw a simple version of this statement earlier, when we discussed the screening-off effect in
...more
a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change. In the construction phase, we need to examine each variable, say C, and ask ourselves which other variables it “listens” to before choosing its value. The chain structure A → B → C means that B listens to A only, C listens to B only, and A listens to no one; that is, it is determined by external forces that are not part of our model.
if we reverse the order of arrows in the chain, thus obtaining A ← B ← C, the causal reading of the structure will change drastically, but the independence conditions will remain the same. The missing arrow between A and C will still mean that A and C are independent once we know the value of B, as in the original chain.
if the observed data do not show A and C to be independent, conditional on B, then we can safely conclude that the chain model is incompatible with the data and needs to be discarded (or repaired). Second, the graphical properties of the diagram dictate which causal models can be distinguished by data and which will forever remain indistinguishable, no matter how large the data. For example, we cannot distinguish the fork A ← B → C from the chain A → B → C by data alone because, with C listening to B only, the two imply the same independence conditions.
An arrow from A to C means that if we could wiggle only A, then we would expect to see a change in the probability of C. A missing arrow from A to C means that in the same experiment we would not see any change in C, once we held constant the parents of C (in other words, B in the example above).
Note that the probabilistic expression “once we know the value of B” has given way to the causal expression “once we hold B constant,” which implies that we are physically preventing B from varying and disabling the arrow from A to B.
the causal fork A ← B → C tells us in no uncertain terms that wiggling A would have no effect on C, no matter how intense the wiggle. On the other hand, a Bayesian network is not equipped to handle a “wiggle,” or to tell the difference between seeing and doing, or indeed to distinguish a fork from a chain. In other words, both a chain and a fork would predict that observed changes in A are associated with changes in C, making no prediction about the effect of “wiggling” A.
The relationships that were discovered between the graphical structure of the diagram and the data that it represents now permit us to emulate wiggling without physically doing so. Specifically, applying a smart sequence of conditioning operations enables us to predict the effect of actions or interventions without actually conducting an experiment.
consider again the causal fork A ← B → C, in which we proclaimed the correlation between A and C to be spurious. We can verify this by an experiment in which we wiggle A and find no correlation between A and C. But we can do better. We can ask the diagram to emulate the experiment and tell us if any conditioning operation can reproduce the correlation that would prevail in the experiment.
someone may ask why wiggling A makes C vary. Is it really the direct effect of A, or is it the effect of a mediating variable B? If both, can we assess what portion of the effect is mediated by B? To answer such mediation questions, we have to envision two simultaneous interventions: wiggling A and holding B constant (to be distinguished from conditioning on B). If we can perform this intervention physically, we obtain the answer to our question. But if we are at the mercy of observational studies, we need to emulate the two actions with a clever set of observations. Again, the graphical
...more
the variable Z at the center of the fork is a confounder of X and Y. (We will see a more universal definition later, but this triangle is the most recognizable and common situation.)
For example, if we are testing a drug and give it to patients who are younger on average than the people in the control group, then age becomes a confounder—a lurking third variable. If we don’t have any data on the ages, we will not be able to disentangle the true effect from the spurious effect.
if the confounding variable Z is age, we compare the treatment and control groups in every age group separately. We can then take an average of the effects, weighting each age group according to its percentage in the target population. This method of compensation is familiar to all statisticians; it is called “adjusting for Z” or “controlling for Z.”
if you have identified a sufficient set of deconfounders in your diagram, gathered data on them, and properly adjusted for them, then you have every right to say that you have computed the causal effect X → Y (provided, of course, that you can defend your causal diagram on scientific grounds).