Algorithms to Live By: The Computer Science of Human Decisions
Rate it:
Open Preview
Read between November 19, 2023 - January 11, 2024
28%
Flag icon
legendary programmer Donald Knuth. “I do one thing at a time,” he says. “This is what computer scientists call batch processing—the alternative is swapping in and out. I don’t swap in and out.”
29%
Flag icon
If we be, therefore, engaged by arguments to put trust in past experience, and make it the standard of our future judgement, these arguments must be probable only. —DAVID HUME
29%
Flag icon
Laplace published an ambitious paper called “Treatise on the Probability of the Causes of Events.” In it, Laplace finally solved the problem of how to make inferences backward from observed effects to their probable causes.
29%
Flag icon
In fact, for any possible drawing of w winning tickets in n attempts, the expectation is simply the number of wins plus one, divided by the number of attempts plus two: (w+1)⁄(n+2).
29%
Flag icon
This incredibly simple scheme for estimating probabilities is known as Laplace’s Law, and it is easy to apply in any situation where you need to assess the chances of an event based on its history.
30%
Flag icon
He also wrote the Philosophical Essay on Probabilities, arguably the first book about probability for a general audience and still one of the best, laying out his theory and considering its applications to law, the sciences, and everyday life.
30%
Flag icon
Want to calculate the chance your bus is late? The chance your softball team will win? Count the number of times it has happened in the past plus one, then divide by the number of opportunities plus two.
30%
Flag icon
The mathematical formula that describes this relationship, tying together our previously held ideas and the evidence before our eyes, has come to be known—ironically, as the real heavy lifting was done by Laplace—as Bayes’s Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.
30%
Flag icon
the Copernican Principle, results in a simple algorithm that can be used to make predictions about all sorts of topics.
30%
Flag icon
To understand why the Copernican Principle works, and why it sometimes doesn’t, we need to return to Bayes. Because despite its apparent simplicity, the Copernican Principle is really an instance of Bayes’s Rule.
31%
Flag icon
assume what’s called the “uniform prior,” which considers every proportion of winning tickets to be equally likely.*
31%
Flag icon
When Bayes’s Rule combines all these probabilities—the more-probable short time spans pushing down the average forecast, the less-probable yet still possible long ones pushing it up—the Copernican Principle emerges: if we want to predict how long something will last, and have no other knowledge about it whatsoever, the best guess we can make is that it will continue just as long as it’s gone on so far.
31%
Flag icon
Recognizing that the Copernican Principle is just Bayes’s Rule with an uninformative prior answers a lot of questions about its validity. The Copernican Principle seems reasonable exactly in those situations where we know nothing at all—
31%
Flag icon
not even sure what timescale is appropriate. And it feels completely wrong in those cases where we do know something about the subject matter.
31%
Flag icon
there are two types of things in the world: things that tend toward (or cluster around) some kind of “natural” value, and things that don’t.
31%
Flag icon
“Gaussian” distribution, after the German mathematician Carl Friedrich Gauss, and informally called the “bell curve”
31%
Flag icon
Many other things in the natural world are normally distributed as well, from human height, weight, and blood pressure to the noontime temperature in a city and the diameter of fruits in an orchard.
31%
Flag icon
if you were to make a graph of the number of towns by population, you wouldn’t see anything remotely like a bell curve. There would be way more towns smaller than 8,226 than larger. At the same time, the larger ones would be way bigger than the average. This kind of pattern typifies what are called “power-law distributions.” These are also known as “scale-free distributions” because they characterize quantities that can plausibly range over many scales:
31%
Flag icon
The power-law distribution characterizes a host of phenomena in everyday life that have the same basic quality as town populations: most things below the mean, and a few enormous ones above it.
31%
Flag icon
money in general is a domain full of power laws. Power-law distributions characterize both people’s wealth and people’s incomes.
31%
Flag icon
the process of “preferential attachment” is one of the surest ways to produce a power-law distribution.
31%
Flag icon
Bayes’s Rule tells us that when it comes to making predictions based on limited evidence, few things are as important as having good priors—that is, a sense of the distribution from which we expect that evidence to have come.
31%
Flag icon
Good predictions thus begin with having good instincts about when we’re dealing with a normal distribution and when with a power-law distribution.
31%
Flag icon
the uninformative prior, with its wildly varying possible scales—the wall that might last for months or for millennia—is a power-law distribution.
31%
Flag icon
for any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to be 2, hence the Copernican prediction; in other power-law cases, the multiplier will depend on the exact distribution you’re working with.
31%
Flag icon
apply Bayes’s Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide.
31%
Flag icon
there’s actually a third category of things in life: those that are neither more nor less likely to end just because they’ve gone on for a while. Sometimes things are simply … invariant.
32%
Flag icon
the spread of intervals between independent events into the function that now carries his name: the Erlang distribution. The shape of this curve differs from both the normal and the power-law:
32%
Flag icon
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer.
32%
Flag icon
distributions that yield the same prediction, no matter their history or current state, are known to statisticians as “memoryless.”
32%
Flag icon
different patterns of optimal prediction—the Multiplicative, Average, and Additive Rules—all result directly from applying Bayes’s Rule to the power-law, normal, and Erlang distributions, respectively.
32%
Flag icon
In a power-law distribution, the longer something has gone on, the longer we expect it to continue going on.
32%
Flag icon
In a normal distribution, events are surprising when they’re early—since we expected them to reach the average—but not when they’re late.
32%
Flag icon
in an Erlang distribution, events by definition are never any more or less surprising no matter when they occur. Any state of affairs is always equally likely to end regardless of how long it’s lasted.
32%
Flag icon
Intuitively, people made different types of predictions for quantities that followed different distributions—power-law, normal, and Erlang—in the real world.
32%
Flag icon
Small data is big data in disguise.
32%
Flag icon
In cases where we don’t have good priors, our predictions aren’t good.
32%
Flag icon
What we project about the future reveals a lot—about the world we live in, and about our own past.
33%
Flag icon
the ability to resist temptation may be, at least in part, a matter of expectations rather than willpower.
33%
Flag icon
children who had waited for two treats grew into young adults who were more successful than the others, even measured by quantitative metrics like their SAT scores.
33%
Flag icon
Failing the marshmallow test—and being less successful in later life—may not be about lacking willpower. It could be a result of believing that adults are not dependable: that they can’t be trusted to keep their word, that they disappear for intervals of arbitrary length. Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.
33%
Flag icon
If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.
34%
Flag icon
The question of how hard to think, and how many factors to consider, is at the heart of a knotty problem that statisticians and machine-learning researchers call “overfitting.”
34%
Flag icon
there’s a wisdom to deliberately thinking less.
34%
Flag icon
Every decision is a kind of prediction: about how much you’ll like something you haven’t tried yet, about where a certain trend is heading, about how the road less traveled (or more so) is likely to pan out. And every prediction, crucially, involves thinking about two distinct things: what you know and what you don’t.
34%
Flag icon
In other words, overfitting poses a danger any time we’re dealing with noise or mismeasurement—and we almost always are.
35%
Flag icon
Cross-Validation means assessing not only how well a model fits the data it’s given, but how well it generalizes to data it hasn’t seen.
35%
Flag icon
If a school’s standardized scores rose while its “nonstandardized” performance moved in the opposite direction, administrators would have a clear warning sign that “teaching to the test” had set in, and the pupils’ skills were beginning to overfit the mechanics of the test itself.
36%
Flag icon
Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.
36%
Flag icon
One algorithm, discovered in 1996 by biostatistician Robert Tibshirani, is called the Lasso and uses as its penalty the total weight of the different factors in the model.*