More on this book
Community
Kindle Notes & Highlights
Read between
November 19, 2023 - January 11, 2024
legendary programmer Donald Knuth. “I do one thing at a time,” he says. “This is what computer scientists call batch processing—the alternative is swapping in and out. I don’t swap in and out.”
If we be, therefore, engaged by arguments to put trust in past experience, and make it the standard of our future judgement, these arguments must be probable only. —DAVID HUME
Laplace published an ambitious paper called “Treatise on the Probability of the Causes of Events.” In it, Laplace finally solved the problem of how to make inferences backward from observed effects to their probable causes.
In fact, for any possible drawing of w winning tickets in n attempts, the expectation is simply the number of wins plus one, divided by the number of attempts plus two: (w+1)⁄(n+2).
This incredibly simple scheme for estimating probabilities is known as Laplace’s Law, and it is easy to apply in any situation where you need to assess the chances of an event based on its history.
He also wrote the Philosophical Essay on Probabilities, arguably the first book about probability for a general audience and still one of the best, laying out his theory and considering its applications to law, the sciences, and everyday life.
Want to calculate the chance your bus is late? The chance your softball team will win? Count the number of times it has happened in the past plus one, then divide by the number of opportunities plus two.
The mathematical formula that describes this relationship, tying together our previously held ideas and the evidence before our eyes, has come to be known—ironically, as the real heavy lifting was done by Laplace—as Bayes’s Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.
the Copernican Principle, results in a simple algorithm that can be used to make predictions about all sorts of topics.
To understand why the Copernican Principle works, and why it sometimes doesn’t, we need to return to Bayes. Because despite its apparent simplicity, the Copernican Principle is really an instance of Bayes’s Rule.
assume what’s called the “uniform prior,” which considers every proportion of winning tickets to be equally likely.*
When Bayes’s Rule combines all these probabilities—the more-probable short time spans pushing down the average forecast, the less-probable yet still possible long ones pushing it up—the Copernican Principle emerges: if we want to predict how long something will last, and have no other knowledge about it whatsoever, the best guess we can make is that it will continue just as long as it’s gone on so far.
Recognizing that the Copernican Principle is just Bayes’s Rule with an uninformative prior answers a lot of questions about its validity. The Copernican Principle seems reasonable exactly in those situations where we know nothing at all—
not even sure what timescale is appropriate. And it feels completely wrong in those cases where we do know something about the subject matter.
there are two types of things in the world: things that tend toward (or cluster around) some kind of “natural” value, and things that don’t.
“Gaussian” distribution, after the German mathematician Carl Friedrich Gauss, and informally called the “bell curve”
Many other things in the natural world are normally distributed as well, from human height, weight, and blood pressure to the noontime temperature in a city and the diameter of fruits in an orchard.
if you were to make a graph of the number of towns by population, you wouldn’t see anything remotely like a bell curve. There would be way more towns smaller than 8,226 than larger. At the same time, the larger ones would be way bigger than the average. This kind of pattern typifies what are called “power-law distributions.” These are also known as “scale-free distributions” because they characterize quantities that can plausibly range over many scales:
The power-law distribution characterizes a host of phenomena in everyday life that have the same basic quality as town populations: most things below the mean, and a few enormous ones above it.
money in general is a domain full of power laws. Power-law distributions characterize both people’s wealth and people’s incomes.
the process of “preferential attachment” is one of the surest ways to produce a power-law distribution.
Bayes’s Rule tells us that when it comes to making predictions based on limited evidence, few things are as important as having good priors—that is, a sense of the distribution from which we expect that evidence to have come.
Good predictions thus begin with having good instincts about when we’re dealing with a normal distribution and when with a power-law distribution.
the uninformative prior, with its wildly varying possible scales—the wall that might last for months or for millennia—is a power-law distribution.
for any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to be 2, hence the Copernican prediction; in other power-law cases, the multiplier will depend on the exact distribution you’re working with.
apply Bayes’s Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide.
there’s actually a third category of things in life: those that are neither more nor less likely to end just because they’ve gone on for a while. Sometimes things are simply … invariant.
the spread of intervals between independent events into the function that now carries his name: the Erlang distribution. The shape of this curve differs from both the normal and the power-law:
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer.
distributions that yield the same prediction, no matter their history or current state, are known to statisticians as “memoryless.”
different patterns of optimal prediction—the Multiplicative, Average, and Additive Rules—all result directly from applying Bayes’s Rule to the power-law, normal, and Erlang distributions, respectively.
In a power-law distribution, the longer something has gone on, the longer we expect it to continue going on.
In a normal distribution, events are surprising when they’re early—since we expected them to reach the average—but not when they’re late.
in an Erlang distribution, events by definition are never any more or less surprising no matter when they occur. Any state of affairs is always equally likely to end regardless of how long it’s lasted.
Intuitively, people made different types of predictions for quantities that followed different distributions—power-law, normal, and Erlang—in the real world.
Small data is big data in disguise.
In cases where we don’t have good priors, our predictions aren’t good.
What we project about the future reveals a lot—about the world we live in, and about our own past.
the ability to resist temptation may be, at least in part, a matter of expectations rather than willpower.
children who had waited for two treats grew into young adults who were more successful than the others, even measured by quantitative metrics like their SAT scores.
Failing the marshmallow test—and being less successful in later life—may not be about lacking willpower. It could be a result of believing that adults are not dependable: that they can’t be trusted to keep their word, that they disappear for intervals of arbitrary length. Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.
If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.
The question of how hard to think, and how many factors to consider, is at the heart of a knotty problem that statisticians and machine-learning researchers call “overfitting.”
there’s a wisdom to deliberately thinking less.
Every decision is a kind of prediction: about how much you’ll like something you haven’t tried yet, about where a certain trend is heading, about how the road less traveled (or more so) is likely to pan out. And every prediction, crucially, involves thinking about two distinct things: what you know and what you don’t.
In other words, overfitting poses a danger any time we’re dealing with noise or mismeasurement—and we almost always are.
Cross-Validation means assessing not only how well a model fits the data it’s given, but how well it generalizes to data it hasn’t seen.
If a school’s standardized scores rose while its “nonstandardized” performance moved in the opposite direction, administrators would have a clear warning sign that “teaching to the test” had set in, and the pupils’ skills were beginning to overfit the mechanics of the test itself.
Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.
One algorithm, discovered in 1996 by biostatistician Robert Tibshirani, is called the Lasso and uses as its penalty the total weight of the different factors in the model.*