More on this book
Community
Kindle Notes & Highlights
Reasoning forward from hypothetical pasts lays the foundation for us to then work backward to the most probable one.
Using calculus, the once-controversial mathematics of which Bayes had been an important defender, Laplace was able to prove that this vast spectrum of possibilities could be distilled down to a single estimate, and a stunningly concise one at that.
In fact, for any possible drawing of w winning tickets in n attempts, the expectation is simply the number of wins plus one, divided by the number of attempts plus two: (w+1)⁄(n+2).
The mathematical formula that describes this relationship, tying together our previously held ideas and the evidence before our eyes, has come to be known—ironically, as the real heavy lifting was done by Laplace—as Bayes’s Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.
This sense of what was “in the bag” before the coin flip—the chances for each hypothesis to have been true before you saw any data—is known as the prior probabilities, or “prior” for short.
The fact that Bayes’s Rule is dependent on the use of priors has at certain points in history been considered controversial, biased, even unscientific. But in reality, it is quite rare to go into a situation so totally unfamiliar that our mind is effectively a blank slate—a point we’ll return to momentarily.
It’s difficult to make predictions, especially about the future. —DANISH PROVERB
Copernicus would make the radical paradigm shift of imagining that the Earth was not the bull’s-eye center of the universe—that it was, in fact, nowhere special in particular. Gott decided to take the same step with regard to time.
More generally, unless we know better we can expect to have shown up precisely halfway into the duration of any given phenomenon.* And if we assume that we’re arriving precisely halfway into something’s duration, the best guess we can make for how long it will last into the future becomes obvious: exactly as long as it’s lasted already.
The Copernican Principle predicts that the United States of America will last as a nation until approximately the year 2255, that Google will last until roughly 2032, and that the relationship your friend began a month ago will probably last about another month (maybe tell him not to RSVP to that wedding invitation just yet).
The smartphone as we know it is barely a decade old, and the Copernican Principle tells us that it isn’t likely to be around in 2025, let alone five centuries later.
Simply displaying how long it’s been since the previous bus arrived at that stop offers a substantial hint about when the next one will.
And it turns out that the Copernican Principle is exactly what results from applying Bayes’s Rule using what is known as an uninformative prior.
Anything longer than eight years is within the realm of possibility—but if the wall were going to be around for a million years, it would be a big coincidence that we happened to bump into it so very close to the start of its existence. Therefore, even though enormously long life spans cannot be ruled out, neither are they very likely.
In the mid-twentieth century, the Bayesian statistician Harold Jeffreys had looked into determining the number of tramcars in a city given the serial number on just one tramcar, and came up with the same answer: double the serial number.
Purely mathematical estimates based on captured tanks’ serial numbers predicted that the Germans were producing 246 tanks every month, while estimates obtained by extensive (and highly risky) aerial reconnaissance suggested the figure was more like 1,400. After the war, German records revealed the true figure: 245.
The richer the prior information we bring to Bayes’s Rule, the more useful the predictions we can get out of it.
They roughly follow what’s termed a “normal” distribution—also known as the “Gaussian” distribution, after the German mathematician Carl Friedrich Gauss, and informally called the “bell curve” for its characteristic shape.
The power-law distribution characterizes a host of phenomena in everyday life that have the same basic quality as town populations: most things below the mean, and a few enormous ones above it.
In fact, money in general is a domain full of power laws. Power-law distributions characterize both people’s wealth and people’s incomes.
It’s often lamented that “the rich get richer,” and indeed the process of “preferential attachment” is one of the surest ways to produce a power-law distribution. The most popular websites are the most likely to get incoming links; the most followed online celebrities are the ones most likely to gain new fans; the most prestigious firms are the ones most likely to attract new clients; the biggest cities are the ones most likely to draw new residents.
Bayes’s Rule tells us that when it comes to making predictions based on limited evidence, few things are as important as having good priors—that is, a sense of the distribution from which we expect that evidence to have come.
Examining the Copernican Principle, we saw that when Bayes’s Rule is given an uninformative prior, it always predicts that the total life span of an object will be exactly double its current age. In fact, the uninformative prior, with its wildly varying possible scales—the wall that might last for months or for millennia—is a power-law distribution. And for any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to
...more
This multiplicative rule is a direct consequence of the fact that power-law distributions do not specify a natural scale for the phenomenon they’re describing.
When we apply Bayes’s Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide.
There are a number of domains in the natural world, too, where events are completely independent from one another and the intervals between them thus fall on an Erlang curve.
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer.
In a power-law distribution, the longer something has gone on, the longer we expect it to continue going on.
In a normal distribution, events are surprising when they’re early—since we expected them to reach the average—but not when they’re late.
And in an Erlang distribution, events by definition are never any more or less surprising no matter when they occur.
If your wait for, say, a win at the roulette wheel were characterized by a normal distribution, then the Average Rule would apply: after a run of bad luck, it’d tell you that your number should be coming any second, probably followed by more losing spins.
If, instead, the wait for a win obeyed a power-law distribution, then the Multiplicative Rule would tell you that winning spins follow quickly after one another, but the longer a drought had gone on the longer it would probably continue.
Up against a memoryless distribution, however, you’re stuck. The Additive Rule tells you the chance of a win now is the same as it was an hour ago, and the same as it will be an hour from now.
If it were a normal distribution, then the Average Rule would give a pretty clear forecast of how long he could expect to live: about eight months. But if it were a power-law, with a tail that stretches far out to the right, then the situation would be quite different: the Multiplicative Rule would tell him that the longer he lived, the more evidence it would provide that he would live longer. Reading further, Gould discovered that “the distribution was indeed, strongly right skewed, with a long tail (however small) that extended for several years above the eight month median. I saw no reason
...more
Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich.
But if people’s predictions are informed by their experiences, we can use Bayes’s Rule to conduct indirect reconnaissance about the world by mining people’s expectations.
As it happens, pharaohs’ reigns follow an Erlang distribution.)
Our judgments betray our expectations, and our expectations betray our experience. What we project about the future reveals a lot—about the world we live in, and about our own past.
In other words, the ability to resist temptation may be, at least in part, a matter of expectations rather than willpower.
If the marshmallow test is about willpower, this is a powerful testament to the impact that learning self-control can have on one’s life. But if the test is less about will than about expectations, then this tells a different, perhaps more poignant story.
Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.
He is careful of what he reads, for that is what he will write. He is careful of what he learns, for that is what he will know. —ANNIE DILLARD
Even when we accumulate biases that aren’t objectively correct, they still usually do a reasonable job of reflecting the specific part of the world we live in.
More or less by definition, events are always experienced at their proper frequencies, but this isn’t at all true of language.
Simply put, the representation of events in the media does not track their frequency in the world.
If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors.
When we think about thinking, it’s easy to assume that more is better: that you will make a better decision the more pros and cons you list, make a better prediction about the price of a stock the more relevant factors you identify, and write a better report the more time you spend working on it.
The question of how hard to think, and how many factors to consider, is at the heart of a knotty problem that statisticians and machine-learning researchers call “overfitting.”