More on this book
Community
Kindle Notes & Highlights
Read between
July 25 - August 19, 2020
Copernican Principle is exactly what results from applying Bayes’s Rule using what is known as an uninformative prior.
one way to plead ignorance would be to assume what’s called the “uniform prior,” which considers every proportion of winning tickets to be equally likely.
This kind of pattern typifies what are called “power-law distributions.” These are also known as “scale-free distributions” because they characterize quantities that can plausibly range over many scales:
Bayes’s Rule tells us that when it comes to making predictions based on limited evidence, few things are as important as having good priors—that is, a sense of the distribution from which we expect that evidence to have come.
There are a number of domains in the natural world, too, where events are completely independent from one another and the intervals between them thus fall on an Erlang curve.
distributions that yield the same prediction, no matter their history or current state, are known to statisticians as “memoryless.”
These three very different patterns of optimal prediction—the Multiplicative, Average, and Additive Rules—all result directly from applying Bayes’s Rule to the power-law, normal, and Erlang distributions, respectively.
Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich.
The fact that, on the whole, people’s hunches seem to closely match the predictions of Bayes’s Rule also makes it possible to reverse-engineer all kinds of prior distributions, even ones about which it’s harder to get authoritative real-world data.
overfitting poses a danger any time we’re dealing with noise or mismeasurement—and we almost always are.
Cross-Validation means assessing not only how well a model fits the data it’s given, but how well it generalizes to data it hasn’t seen. Paradoxically, this may involve using less data.
Alongside such tests, however, schools could randomly assess some small fraction of the students—one per class, say, or one in a hundred—using a different evaluation method, perhaps something like an essay or an oral exam.
Occam’s razor principle, which suggests that, all things being equal, the simplest possible hypothesis is probably the correct one.
Imposing penalties on the ultimate complexity of a model is not the only way to alleviate overfitting, however. You can also nudge a model toward simplicity by controlling the speed with which you allow it to adapt to incoming data.
As a species, being constrained by the past makes us less perfectly adjusted to the present we know but helps keep us robust for the future we don’t.
In machine learning, the advantages of moving slowly emerge most concretely in a regularization technique known as Early Stopping.
Giving yourself more time to decide about something does not necessarily mean that you’ll make a better decision. But it does guarantee that you’ll end up considering more factors, more hypotheticals, more pros and cons, and thus risk overfitting.
The underlying issue, Tom eventually realized, was that he’d been using his own taste and judgment as a kind of proxy metric for his students’. This proxy metric worked reasonably well as an approximation, but it wasn’t worth overfitting—which explained why spending extra hours painstakingly “perfecting” all the slides had been counterproductive.
we can make better decisions by deliberately thinking and doing less.
If you have high uncertainty and limited data, then do stop early by all means.
how to best approach problems whose optimal answers are out of reach. How to relax.
Constraint Relaxation. In this technique, researchers remove some of the problem’s constraints and set about solving the problem they wish they had.
we can use the relaxed problem—the fantasy—as a lower bound on the reality.
If you can’t solve the problem in front of you, solve an easier version of it—and then see if that solution offers you a starting point, or a beacon, in the full-blown problem. Maybe it does.
there are cases where randomized algorithms can produce good approximate answers to difficult questions faster than all known deterministic algorithms.
on certain problems, randomized approaches can outperform even the best deterministic ones.
A statistic can only tell us part of the story, obscuring any underlying heterogeneity.
Time and space are at the root of the most familiar tradeoffs in computer science, but recent work on randomized algorithms shows that there’s also another variable to consider: certainty.
“What we’re going to do is come up with an answer which saves you in time and space and trades off this third dimension: error probability.”
One approach is to augment Hill Climbing with what’s known as “jitter”:
when we reach a local maximum, and start Hill Climbing anew from this random new starting point. This algorithm is known, appropriately enough, as “Random-Restart Hill Climbing”—or, more colorfully, as “Shotgun Hill Climbing.”
But there’s also a third approach: instead of turning to full-bore randomness when you’re stuck, use a little bit of randomness every time you make a decision. This technique, developed by the same Los Alamos team that came up with the Monte Carlo Method, is called the Metropolis Algorithm.
If a randomly generated tweak to our travel route results in an improvement, then we always accept it, and continue tweaking from there. But if the alteration would make thing a little worse, there’s still a chance that we go with it anyway (although the worse the alteration is, the smaller the chance).
So what would happen, Kirkpatrick wondered, if you treated an optimization problem like an annealing problem—if you “heated it up” and then slowly “cooled it off”?
To this day, simulated annealing remains one of the most promising approaches to optimization problems known to the field.
Circuit switching makes plenty of sense for human interaction, but as early as the 1960s it was clear that this paradigm wasn’t going to work for machine communications.
In circuit-switched networks, a call fails if any one of its links gets disrupted—which means that reliability goes down exponentially as a network grows larger. In packet switching, on the other hand, the proliferation of paths in a growing network becomes a virtue: there are now that many more ways for data to flow, so the reliability of the network increases exponentially with its size.
In most scenarios the consequences of communication lapses are rarely so dire, and the need for certainty rarely so absolute.
Three such redundant ACKs in a row would signal to your machine that 101 isn’t just delayed but hopelessly gone, so it will resend that packet.
All those acknowledgments can actually add up to a considerable amount of traffic.
So how exactly should we handle a person—or a computer—that’s unreliable? The first question is how long a period of nonresponsiveness we should take to constitute a breakdown.
Exponential Backoff: The Algorithm of Forgiveness
In human society, we tend to adopt a policy of giving people some finite number of chances in a row, then giving up entirely. Three strikes, you’re out. This pattern prevails by default in almost any situation that requires forgiveness, lenience, or perseverance. Simply put, maybe we’re doing it wrong.
Tail Drop: an unceremonious way of saying that every packet arriving after that point is simply rejected, and effectively deleted.
“successful investing is anticipating the anticipations of others.”
In this way, the value of a stock isn’t what people think it’s worth but what people think people think it’s worth.
In poker, recursion is a dangerous game. You don’t want to get caught one step behind your opponent, of course—but there’s also an imperative not to get too far ahead of them either. “There’s a rule that you really only want to play one level above your opponent,”
equilibrium: that is, a set of strategies that both players can follow such that neither player would want to change their own play, given the play of their opponent. It’s called an equilibrium because it’s stable—no amount of further reflection by either player will bring them to different choices.
More generally, the Nash equilibrium offers a prediction of the stable long-term outcome of any set of rules or incentives. As such, it provides an invaluable tool for both predicting and shaping economic policy, as well as social policy in general.
In fact, this makes defection not merely the equilibrium strategy but what’s known as a dominant strategy. A dominant strategy avoids recursion altogether, by being the best response to all of your opponent’s possible strategies—so you don’t even need to trouble yourself getting inside their head at all. A dominant strategy is a powerful thing.