More on this book
Community
Kindle Notes & Highlights
Read between
September 27 - November 20, 2024
Gambling is characterized by a similar kind of steady-state expectancy. If your wait for, say, a win at the roulette wheel were characterized by a normal distribution, then the Average Rule would apply: after a run of bad luck, it’d tell you that your number should be coming any second, probably followed by more losing spins.
If, instead, the wait for a win obeyed a power-law distribution, then the Multiplicative Rule would tell you that winning spins follow quickly after one another, but the longer a drought had gone on the longer it would probably continue.
Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich. Whether we know it or not, we appear to carry around in our heads surprisingly accurate priors about movie grosses and running times, poem lengths, and political terms of office, not to mention human life spans.
Failing the marshmallow test—and being less successful in later life—may not be about lacking willpower. It could be a result of believing that adults are not dependable: that they can’t be trusted to keep their word, that they disappear for intervals of arbitrary length. Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.
He is careful of what he reads, for that is what he will write. He is careful of what he learns, for that is what he will know. —ANNIE DILLARD
There’s a curious tension, then, between communicating with others and maintaining accurate priors about the world. When people talk about what interests them—and offer stories they think their listeners will find interesting—it skews the statistics of our experience. That makes it hard to maintain appropriate prior distributions. And the challenge has only increased with the development of the printing press, the nightly news, and social media—innovations that allow our species to spread language mechanically.
Simply put, the representation of events in the media does not track their frequency in the world. As sociologist Barry Glassner notes, the murder rate in the United States declined by 20% over the course of the 1990s, yet during that time period the presence of gun violence on American news increased by 600%.
If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.
So one of the deepest truths of machine learning is that, in fact, it’s not always better to use a more complex model, one that takes a greater number of factors into account. And the issue is not just that the extra factors might offer diminishing returns—performing better than a simpler model, but not enough to justify the added complexity. Rather, they might make our predictions dramatically worse.
In other words, overfitting poses a danger any time we’re dealing with noise or mismeasurement—and we almost always are.
But being able to modify the foods available to us broke that relationship. We can now add fat and sugar to foods beyond amounts that are good for us, and then eat those foods exclusively rather than the mix of plants, grains, and meats that historically made up the human diet. In other words, we can overfit taste.
Certain visible signs of fitness—low body fat and high muscle mass, for example—are easy to measure, and they are related to, say, minimizing the risk of heart disease and other ailments. But they, too, are an imperfect proxy measure. Overfitting the signals—adopting an extreme diet to lower body fat and taking steroids to build muscle, perhaps—can make you the picture of good health, but only the picture.
There are stories of police officers who find themselves, for instance, taking time out during a gunfight to put their spent casings in their pockets—good etiquette on a firing range. As former Army Ranger and West Point psychology professor Dave Grossman writes, “After the smoke had settled in many real gunfights, officers were shocked to discover empty brass in their pockets with no memory of how it got there. On several occasions, dead cops were found with brass in their hands, dying in the middle of an administrative procedure that had been drilled into them.”
Simply put, Cross-Validation means assessing not only how well a model fits the data it’s given, but how well it generalizes to data it hasn’t seen. Paradoxically, this may involve using less data.
Alongside such tests, however, schools could randomly assess some small fraction of the students—one per class, say, or one in a hundred—using a different evaluation method, perhaps something like an essay or an oral exam.
The standardized tests would provide immediate feedback—you could have students take a short computerized exam every week and chart the class’s progress almost in real time, for instance—while the secondary data points would serve to cross-validate: to make sure that the students were actually acquiring the knowledge that the standardized test is meant to measure, and not simply getting better at test-taking.
If you can’t explain it simply, you don’t understand it well enough.
If we introduce a complexity penalty, then more complex models need to do not merely a better job but a significantly better job of explaining the data to justify their greater complexity. Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.
The burden of metabolism, for instance, acts as a brake on the complexity of organisms, introducing a caloric penalty for overly elaborate machinery. The fact that the human brain burns about a fifth of humans’ total daily caloric intake is a testament to the evolutionary advantages that our intellectual abilities provide us with: the brain’s contributions must somehow more than pay for that sizable fuel bill.
Language forms yet another natural Lasso: complexity is punished by the labor of speaking at greater length and the taxing of our listener’s attention span. Business plans get compressed to an elevator pitch; life advice becomes proverbial wisdom only if it is sufficiently concise and catchy. And anything that needs to be remembered has to pass through the inherent Lasso of memory.
“In contrast to the widely held view that less processing reduces accuracy,” they write, “the study of heuristics shows that less information, computation, and time can in fact improve accuracy.” A heuristic that favors simpler answers—with fewer factors, or less computation—offers precisely these “less is more” effects.
Having to make use of existing materials, on the other hand, imposes a kind of useful restraint. It makes it harder to induce drastic changes in the structure of organisms, harder to overfit. As a species, being constrained by the past makes us less perfectly adjusted to the present we know but helps keep us robust for the future we don’t.
The effectiveness of regularization in all kinds of machine-learning tasks suggests that we can make better decisions by deliberately thinking and doing less. If the factors we come up with first are likely to be the most important ones, then beyond a certain point thinking more about a problem is not only going to be a waste of time and effort—it will lead us to worse solutions. Early Stopping provides the foundation for a reasoned argument against reasoning, the thinking person’s case against thought.
If you have high uncertainty and limited data, then do stop early by all means. If you don’t have a clear read on how your work will be evaluated, and by whom, then it’s not worth the extra time to make it perfect with respect to your own (or anyone else’s) idiosyncratic guess at what perfection might be.
The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting—that is, the more you should prefer simplicity, and the earlier you should stop.
They asserted what’s now known as the Cobham–Edmonds thesis: an algorithm should be considered “efficient” if it runs in what’s called “polynomial time”—that is, O(n2), O(n3), or in fact n to the power of any number at all. A problem, in turn, is considered “tractable” if we know how to solve it using an efficient algorithm. A problem we don’t know how to solve in polynomial time, on the other hand, is considered “intractable.”
One of the simplest forms of relaxation in computer science is known as Constraint Relaxation. In this technique, researchers remove some of the problem’s constraints and set about solving the problem they wish they had. Then, after they’ve made a certain amount of headway, they try to add the constraints back in. That is, they make the problem temporarily easier to handle before bringing it back to reality.
And while the minimum spanning tree doesn’t necessarily lead straight to the solution of the real problem, it is quite useful all the same. For one thing, the spanning tree, with its free backtracking, will never be any longer than the real solution, which has to follow all the rules. Therefore, we can use the relaxed problem—the fantasy—as a lower bound on the reality. If we calculate that the spanning tree distance for a particular set of towns is 100 miles, we can be sure the traveling salesman distance will be no less than that. And if we find, say, a 110-mile route, we can be certain it
...more
The idea behind such thought exercises is exactly that of Constraint Relaxation: to make the intractable tractable, to make progress in an idealized world that can be ported back to the real one. If you can’t solve the problem in front of you, solve an easier version of it—and then see if that solution offers you a starting point, or a beacon, in the full-blown problem.
What you’d ideally want to find, then, is the smallest subgroup of your friends that knows all the rest of your social circle—which would let you lick the fewest envelopes and still get everyone to attend. Granted, this might sound like a lot of work just to save a few bucks on stamps, but it’s exactly the kind of problem that political campaign managers and corporate marketers want to solve to spread their message most effectively. And it’s also the problem that epidemiologists study in thinking about, say, the minimum number of people in a population—and which people—to vaccinate to protect
...more
It turns out that for the invitations problem, Continuous Relaxation with rounding will give us an easily computed solution that’s not half bad: it’s mathematically guaranteed to get everyone you want to the party while sending out at most twice as many invitations as the best solution obtainable by brute force. Similarly, in the fire truck problem, Continuous Relaxation with probabilities can quickly get us within a comfortable bound of the optimal answer.
The idea behind Lagrangian Relaxation is simple. An optimization problem has two parts: the rules and the scorekeeping. In Lagrangian Relaxation, we take some of the problem’s constraints and bake them into the scoring system instead. That is, we take the impossible and downgrade it to costly.
When an optimization problem’s constraints say “Do it, or else!,” Lagrangian Relaxation replies, “Or else what?” Once we can color outside the lines—even just a little bit, and even at a steep cost—problems become tractable that weren’t tractable before.
Unless we’re willing to spend eons striving for perfection every time we encounter a hitch, hard problems demand that instead of spinning our tires we imagine easier versions and tackle those first. When applied correctly, this is not just wishful thinking, not fantasy or idle daydreaming. It’s one of our best ways of making progress.
Laplace’s proposal pointed to a profound general truth: when we want to know something about a complex quantity, we can estimate its value by sampling from it. This is exactly the kind of calculation that his work on Bayes’s Rule helps us to perform.
replacing exhaustive probability calculations with sample simulations—the Monte Carlo Method, after the Monte Carlo casino in Monaco, a place equally dependent on the vagaries of chance. The Los Alamos team was able to use it to solve key problems in nuclear physics. Today the Monte Carlo Method is one of the cornerstones of scientific computing.
But perhaps the most surprising realization about the power of randomness is that it can be used in situations where chance seemingly plays no role at all. Even if you want the answer to a question that is strictly yes or no, true or false—no probabilities about it—rolling a few dice may still be part of the solution.
As it happens, it is much easier to multiply primes together than to factor them back out. With big enough primes—say, a thousand digits—the multiplication can be done in a fraction of a second while the factoring could take literally millions of years; this makes for what is known as a “one-way function.”
When we need to make sense of, say, national health care reform—a vast apparatus too complex to be readily understood—our political leaders typically offer us two things: cherry-picked personal anecdotes and aggregate summary statistics. The anecdotes, of course, are rich and vivid, but they’re unrepresentative. Almost any piece of legislation, no matter how enlightened or misguided, will leave someone better off and someone worse off, so carefully selected stories don’t offer any perspective on broader patterns.
Aggregate statistics, on the other hand, are the reverse: comprehensive but thin. We might learn, for instance, whether average premiums fell nationwide, but not how that change works out on a more granular level: they might go down for most but, Omelas-style, leave some specific group—undergraduates, or Alaskans, or pregnant women—in dire straits.
A close examination of random samples can be one of the most effective means of making sense of something too complex to be comprehended directly.
When it comes to handling a qualitatively unmanageable problem, something so thorny and complicated that it can’t be digested whole—solitaire or atomic fission, primality testing or public policy—sampling offers one of the simplest, and also the best, ways of cutting through the difficulties.
We can see this approach at work with the charity GiveDirectly, which distributes unconditional cash transfers to people living in extreme poverty in Kenya and Uganda. It has attracted attention for rethinking conventional charity practices on a number of levels: not only in its unusual mission, bu...
This highlight has been truncated due to consecutive passage length restrictions.
If you’re willing to tolerate an error rate of just 1% or 2%, storing your findings in a probabilistic data structure like a Bloom filter will save you significant amounts of both time and space. And the usefulness of such filters is not confined to search engines: Bloom filters have shipped with a number of recent web browsers to check URLs against a list of known malicious websites, and they are also an important part of cryptocurrencies like Bitcoin.
In the case of vacation planning, local maxima are fortunately less fatal, but they have the same character. Even once we’ve found a solution that can’t be improved by any small tweaks, it’s possible that we are still missing the global maximum. The true best itinerary may require a radical overhaul of the trip: doing entire continents in a different order, for instance, or proceeding westward instead of eastward.
We may need to temporarily worsen our solution if we want to continue searching for improvements. And randomness provides a strategy—actually, several strategies—for doing just that.
This technique, developed by the same Los Alamos team that came up with the Monte Carlo Method, is called the Metropolis Algorithm. The Metropolis Algorithm is like Hill Climbing, trying out different small-scale tweaks on a solution, but with one important difference: at any given point, it will potentially accept bad tweaks as well as good ones.
Again, we try to tweak our proposed solution by jiggling around the positions of different cities. If a randomly generated tweak to our travel route results in an improvement, then we always accept it, and continue tweaking from there. But if the alteration would make thing a little worse, there’s still a chance that we go with it anyway (although the worse the alteration is, the smaller the chance).
Whether it’s jitter, random restarts, or being open to occasional worsening, randomness is incredibly useful for avoiding local maxima. Chance is not just a viable way of dealing with tough optimization problems; in many cases, it’s essential.
In physics, what we call “temperature” is really velocity—random motion at the molecular scale.