More on this book
Kindle Notes & Highlights
by
Gary Smith
Started reading
July 16, 2020
A wonderful example is the Monty Hall problem: On the television show Let’s Make a Deal, you are offered a choice of what is behind one of three doors, one a grand prize and the other two goats. After you pick a door, the host, Monty Hall, does what he always does by showing you a goat behind a door you did not choose and asking if you want to switch doors.
Does it matter if Monty Hall reminds you that there is a goat behind one of these doors, or if he proves it by showing you a goat? You haven’t learned anything useful about the door you did choose. There is still a one-third chance that it is the winning door, and therefore, the probability that the last door is the winner has risen to two-thirds. You should switch.
Don’t be Fooled: Data clusters are everywhere, even in random data. Someone who looks for an explanation will inevitably find one, but a theory that fits a data cluster is not persuasive evidence. The found explanation needs to make sense and it needs to be tested with uncontaminated data.
In 1996 the Gardner brothers wrote a wildly popular book with the beguiling name, The Motley Fool Investment Guide: How the Fools Beat Wall Street’s Wise Men and How You Can Too. Hey, if fools can beat the market, so can we all. The Gardners recommended what they called the Foolish Four Strategy. They claimed that during the years 1973–93, the Foolish Four Strategy had an annual average return of 25 percent
But beyond this kernel of a borrowed idea, the Foolish Four Strategy is pure data mining.
Shortly after the Gardners launched the Foolish Four Strategy, two skeptical finance professors tested it using data from the years 1949–72, just prior to the period data mined by the Gardners. It didn’t work. The professors also retested the Foolish Four Strategy during the years that were data mined by the Gardners, but with a clever twist. Instead of choosing the portfolio on the first trading day in January, they implemented the strategy on the first trading day of July. If the strategy has any merit, it shouldn’t be sensitive to the starting month. But, of course, it was.
Another common problem with theorizing about what we observe is the survivor bias that can occur because we don’t see things that no longer exist. A study of the elderly does not include people who did not live long enough to become elderly. An examination of planes that survived bombing runs does not include planes that were shot down.
Be doubly skeptical of graphs that have two vertical axes and omit zero from either or both axes.
A very common logical error is to confuse two conditional statements. The probability that a person who has a disease will have a positive test result is not the same as the probability that a person with a positive test result has the disease.
As the population grows over time, so do many human activities (including the number of people watching television, eating oranges, and dying) which are unrelated but nonetheless statistically correlated because they grow with the population. Watching television does not make us eat oranges and eating oranges does not kill us.
When you hear a puzzling assertion (or even one that makes sense), think about whether confounding factors might by responsible. Sweden has a higher female mortality rate than Costa Rica—because there are more elderly women in Sweden. Berkeley’s graduate programs admitted fewer female applicants—because women applied to more selective programs.
Expect those at the extremes to regress to the mean.
Researchers seeking fame and funding often turn into Texas sharpshooters, firing at random and painting a bullseye around the area with the most hits. It is easy to find a theory that fits the data if the data are used to invent the theory.
DATA WITHOUT THEORY ARE JUST DATA

