More on this book
Community
Kindle Notes & Highlights
Read between
April 23 - September 15, 2018
Computers themselves do something like this: they wait until some fixed interval and check everything, instead of context-switching to handle separate, uncoordinated interrupts from their various subcomponents.
In academia, holding office hours is a way of coalescing interruptions from students. And in the private sector, interrupt coalescing offers a redemptive view of one of the most maligned office rituals: the weekly meeting.
“This is what computer scientists call batch processing—the alternative is swapping in and out. I don’t swap in and out.”
“Email is a wonderful thing for people whose role in life is to be on top of things. But not for me; my role is to be on the bottom of things. What I do takes long hours of studying and uninterruptible concentration.” He reviews all his postal mail every three months, and all his faxes every six.
Our days are full of “small data.” In fact, like Gott standing at the Berlin Wall, we often have to make an inference from the smallest amount of data we could possibly have: a single observation.
If we be, therefore, engaged by arguments to put trust in past experience, and make it the standard of our future judgement, these arguments must be probable only.
the question of making predictions from small data weighed heavily on the mind of the Reverend Thomas Bayes, a Presbyterian minister in the charming spa town of Tunbridge Wells, England.
If we buy ten tickets for a new and unfamiliar raffle, Bayes imagined, and five of them win prizes, then it seems relatively easy to estimate the raffle’s chances of a win: 5/10, or 50%. But what if instead we buy a single ticket and it wins a prize? Do we really imagine the probability of winning to be 1/1, or 100%? That seems too optimistic. Is it? And if so, by how much? What should we actually guess?
Bayes’s critical insight was that trying to use the winning and losing tickets we see to figure out the overall ticket pool that they came from is essentially reasoning backward. And to do that, he argued, we need to first reason forward from hypotheticals. In other words, we need to first determine how probable it is that we would have drawn the tickets we did if various scenarios were true. This probability—known to modern statisticians as the “likelihood”—gives us the information we need to solve the problem.
This is the crux of Bayes’s argument. Reasoning forward from hypothetical pasts lays the foundation for us to then work backward to the most probable one.
showed, then after drawing a winning ticket on our first try we should expect that the proportion of winning tickets in the whole pool is exactly 2/3.
This incredibly simple scheme for estimating probabilities is known as Laplace’s Law,
He also wrote the Philosophical Essay on Probabilities, arguably the first book about probability for a general audience and still one of the best, laying out his theory and considering its applications to law, the sciences, and everyday life.
as the real heavy lifting was done by Laplace—as Bayes’s Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.
(You can’t multiply the two probabilities together when you don’t have one of them.)
And Bayes’s Rule always needs some prior from you,
The fact that Bayes’s Rule is dependent on the use of priors has at certain points in history been considered controversial, biased, even unscientific. But in reality, it is quite rare to go into a situation so totally unfamiliar that our mind is effectively a blank slate—a point we’ll return to momentarily.
And it turns out that the Copernican Principle is exactly what results from applying Bayes’s Rule using what is known as an uninformative prior.
better. The richer the prior information we bring to Bayes’s Rule, the more useful the predictions we can get out of it.
This kind of pattern typifies what are called “power-law distributions.”
The power-law distribution characterizes a host of phenomena in everyday life that have the same basic quality as town populations: most things below the mean, and a few enormous ones above it.
In fact, money in general is a domain full of power laws. Power-law distributions characterize both people’s wealth and people’s incomes. The
It’s often lamented that “the rich get richer,” and indeed the process of “preferential attachment” is one of the surest ways to produce a power-law distribution.
Examining the Copernican Principle, we saw that when Bayes’s Rule is given an uninformative prior, it always predicts that the total life span of an object will be exactly double its current age.
Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor.
possible that
Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—
Something normally distributed that’s gone on seemingly too long is bound to end shortly; but the longer something in a power-law distribution has gone on, the longer you can expect it to keep going.
Prior distribution assumption might give you a hint about your prediction. Normal or scalefree(power law) ?
The Danish mathematician Agner Krarup Erlang, who studied such phenomena, formalized the spread of intervals between independent events into the function that now carries his name: the Erlang distribution.
Since then, the Erlang distribution has also been used by urban planners and architects to model car and pedestrian traffic, and by networking engineers designing infrastructure for the Internet.
between them thus fall on an Erlang curve. Radioactive decay is one example, which means that the Erlang distribution perfectly models when to expect the next ticks of a Geiger counter.
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule:
in fact, his prediction is entirely correct. Indeed, distributions that yield the same prediction, no matter their history or current state,
“memoryless.”
These three very different patterns of optimal prediction—the Multiplicative, Average, and Additive Rules—all result directly from applying Bayes’s Rule to the power-law, normal, and Erlang distributions, respectively.
In a power-law distribution, the longer something has gone on, the longer we expect it to continue going on. So a power-law event is more surprising the longer we’ve been waiting for it—and maximally surprising right before it happens. A
In a normal distribution, events are surprising when they’re early—since we expected them to reach the average—but not when they’re late. Indeed, by that point they seem overdue to happen, so the longer we wait, the more we expect them.
And in an Erlang distribution, events by definition are never any more or less surprising no matter when they occur.
“Know when to walk away / Know when to run”—but for a memoryless distribution, there is no right time to quit. This may in part explain these games’ addictiveness.
Knowing what distribution you’re up against can make all the difference.
The three prediction rules—Multiplicative, Average, and Additive—are applicable in a wide range of everyday situations.
The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich.
Over the past decade, approaches like these have enabled cognitive scientists to identify people’s prior distributions across a broad swath of domains, from vision to language.
People simply didn’t have enough everyday exposure to have an intuitive feel for the range of those values, so their predictions, of course, faltered.
Good predictions require good priors.
if the amount of time it takes for adults to come back is governed by a power-law distribution—with long absences suggesting even longer waits lie ahead—then cutting one’s losses at some point can make perfect sense.