More on this book
Community
Kindle Notes & Highlights
Before he became a professor of operations research at Carnegie Mellon, Michael Trick was a graduate student, looking for love. “It hit me that the problem has been studied: it is the Secretary Problem! I had a position to fill [and] a series of applicants, and my goal was to pick the best applicant for the position.” So he ran the numbers. He didn’t know how many women he could expect to meet in his lifetime, but there’s a certain flexibility in the 37% Rule: it can be applied to either the number of applicants or the time over which one is searching.
Assuming that his search would run from ages eighteen to forty, the 37% Rule gave age 26.1 years as the point at which to switch from looking to leaping.
In 1969, Marvin Zelen, a biostatistician who spent most of his career at Harvard, proposed conducting “adaptive” trials. One of the ideas he suggested was a randomized “play the winner” algorithm—a version of Win-Stay, Lose-Shift, in which the chance of using a given treatment is increased by each win and decreased by each loss.
More generally, our intuitions about rationality are too often informed by exploitation rather than exploration. When we talk about decision-making, we usually focus just on the immediate payoff of a single decision—and if you treat every decision as if it were your last, then indeed only exploitation makes sense. But over a lifetime, you’re going to make a lot of decisions. And it’s actually rational to emphasize exploration—the new rather than the best, the exciting rather than the safe, the random rather than the considered—for many of those choices, particularly earlier in life. What we
...more
As long as the two stacks were themselves sorted, the procedure of merging them into a single sorted stack was incredibly straightforward and took linear time: simply compare the two top cards to each other, move the smaller of them to the new stack you’re creating, and repeat until finished. The program that John von Neumann wrote in 1945 to demonstrate the power of the stored-program computer took the idea of collating to its beautiful and ultimate conclusion. Sorting two cards is simple: just put the smaller one on top. And given a pair of two-card stacks, both of them sorted, you can
...more
If you’re still strategizing about that bookshelf, the Mergesort solution would be to order a pizza and invite over a few friends. Divide the books evenly, and have each person sort their own stack. Then pair people up and have them merge their stacks. Repeat this process until there are just two stacks left, and merge them one last time onto the shelf.
A Mergesort in action. Given a shelf of eight unsorted books, start by putting adjacent books into sorted pairs. Then collate the pairs into ordered sets of four, and finally collate those sets to get a fully sorted shelf.
Perhaps the most prevalent tournament format, however, is a bracket tournament—as in the famous NCAA basketball “March Madness,” among many others. The March Madness tournament progresses from the “Round of 64” and the “Round of 32” to the “Sweet 16,” “Elite Eight,” “Final Four,” and the finals. Each round divides the field in half: does that sound familiarly logarithmic? These tournaments are effectively Mergesort, beginning with unsorted pairs of teams and collating, collating, collating them.
Thus, while the authoritative programming tome Sorting and Searching boldly declares that “bubble sort has no apparent redeeming features,” the research of Ackley and his collaborators suggests that there may be a place for algorithms like Bubble Sort after all. Its very inefficiency—moving items only one position at a time—makes it fairly robust against noise, far more robust than faster algorithms like Mergesort,
But in fact it isn’t Bubble Sort that emerges as the single best algorithm in the face of a noisy comparator. The winner of that particular honor is an algorithm called Comparison Counting Sort. In this algorithm, each item is compared to all the others, generating a tally of how many items it is bigger than. This number can then be used directly as the item’s rank. Since it compares all pairs, Comparison Counting Sort is a quadratic-time algorithm, like Bubble Sort.
Comparison Counting Sort operates exactly like a Round-Robin tournament. In other words, it strongly resembles a sports team’s regular season—playing every other team in the division and building up a win-loss record by which they are ranked.
The Mergesort postseason is chancy, but the Comparison Counting regular season is not; championship rings aren’t robust, but divisional standings are literally as robust as it gets.
The scenario Johnson examined was bookbinding, where each book needs to be printed on one machine and then bound on another. But the most common instance of this two-machine setup is much closer to home: the laundry. When you wash your clothes, they have to pass through the washer and the dryer in sequence, and different loads will take different amounts of time in each. A heavily soiled load might take longer to wash but the usual time to dry; a large load might take longer to dry but the usual time to wash. So, Johnson asked, if you have several loads of laundry to do on the same day, what’s
...more
Intuitively, Johnson’s algorithm works because regardless of how you sequence the loads, there’s going to be some time at the start when the washer is running but not the dryer, and some time at the end when the dryer is running but not the washer. By having the shortest washing times at the start, and the shortest drying times at the end, you maximize the amount of overlap—when the washer and dryer are running simultaneously.
Thus you can keep the total amount of time spent doing laundry to the absolute minimum. Johnson’s analysis had yielded scheduling’s first optimal algorithm: start with the lightest wash, end with the smallest hamper.
What happens in a priority inversion is that a low-priority task takes possession of a system resource (access to a database, let’s say) to do some work, but is then interrupted partway through that work by a timer, which pauses it and invokes the system scheduler. The scheduler tees up a high-priority task, but it can’t run because the database is occupied. And so the scheduler moves down the priority list, running various unblocked medium-priority tasks instead—rather than the high-priority one (which is blocked), or the low-priority one that’s blocking it (which is stuck in line behind all
...more
The comedian Mitch Hedberg recounts a time when “I was at a casino, I was minding my own business, and this guy came up and said, ‘You’re gonna have to move, you’re blocking the fire exit.’ As though if there was a fire, I wasn’t gonna run.” The bouncer’s argument was priority inversion; Hedberg’s rebuttal was priority inheritance. Hedberg lounging casually in front of a fleeing mob puts his low-priority loitering ahead of their high-priority running for their lives—but not if he inherits their priority. And an onrushing mob has a way of making one inherit their priority rather quickly. As
...more
In the broadest sense, there are two types of things in the world: things that tend toward (or cluster around) some kind of “natural” value, and things that don’t.
Normal distributions tend to have a single appropriate scale: a one-digit life span is considered tragic, a three-digit one extraordinary.
There are a number of things in the world that don’t look normally distributed, however—not by a long shot. The average population of a town in the United States, for instance, is 8,226. But if you were to make a graph of the number of towns by population, you wouldn’t see anything remotely like a bell curve. There would be way more towns smaller than 8,226 than larger. At the same time, the larger ones would be way bigger than the average. This kind of pattern typifies what are called “power-law distributions.”
These are also known as “scale-free distributions” because they characterize quantities that can plausibly range over many scales: a town can have tens, hundreds, thousands, tens of thousands, hundreds of thousands, or millions of residents, so we can’t pin down a single value for how big a “normal” town should be.
Examining the Copernican Principle, we saw that when Bayes’s Rule is given an uninformative prior, it always predicts that the total life span of an object will be exactly double its current age. In fact, the uninformative prior, with its wildly varying possible scales—the wall that might last for months or for millennia—is a power-law distribution. And for any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to
...more
When we apply Bayes’s Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide. For instance, if somebody is younger than the average life span, then simply predict the average; as their age gets close to and then exceeds the average, predict that they’ll live a few years more.
Between those two extremes, there’s actually a third category of things in life: those that are neither more nor less likely to end just because they’ve gone on for a while. Sometimes things are simply … invariant. The Danish mathematician Agner Krarup Erlang, who studied such phenomena, formalized the spread of intervals between independent events into the function that now carries his name: the Erlang distribution.
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer. The familiar refrain of “Just five more minutes!… [five minutes later] Five more minutes!” that so often characterizes human claims regarding, say, one’s readiness to leave the house or office, or the time until the completion of some task, may seem indicative of some chronic failure to make realistic estimates. Well, in the cases where one’s up against an Erlang distribution, anyway, that refrain happens to be correct.
Indeed, distributions that yield the same prediction, no matter their history or current state, are known to statisticians as “memoryless.”
The three prediction rules—Multiplicative, Average, and Additive—are applicable in a wide range of everyday situations. And in those situations, people in general turn out to be remarkably good at using the right prediction rule.
Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich.
Decades after the original marshmallow experiments, Walter Mischel and his colleagues went back and looked at how the participants were faring in life. Astonishingly, they found that children who had waited for two treats grew into young adults who were more successful than the others, even measured by quantitative metrics like their SAT scores. If the marshmallow test is about willpower, this is a powerful testament to the impact that learning self-control can have on one’s life.
If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.
Overfitting, for instance, explains the irony of our palates. How can it be that the foods that taste best to us are broadly considered to be bad for our health, when the entire function of taste buds, evolutionarily speaking, is to prevent us from eating things that are bad? The answer is that taste is our body’s proxy metric for health. Fat, sugar, and salt are important nutrients, and for a couple hundred thousand years, being drawn to foods containing them was a reasonable measure for a sustaining diet. But being able to modify the foods available to us broke that relationship.
Beware: when you go to the gym to work off the extra weight from all that sugar, you can also risk overfitting fitness. Certain visible signs of fitness—low body fat and high muscle mass, for example—are easy to measure, and they are related to, say, minimizing the risk of heart disease and other ailments. But they, too, are an imperfect proxy measure.
Perhaps nowhere, however, is overfitting as powerful and troublesome as in the world of business. “Incentive structures work,” as Steve Jobs put it. “So you have to be very careful of what you incent people to do, because various incentive structures create all sorts of consequences that you can’t anticipate.”
But when overfitting creeps in, it can prove disastrous. There are stories of police officers who find themselves, for instance, taking time out during a gunfight to put their spent casings in their pockets—good etiquette on a firing range.
Mistakes like these are known in law enforcement and the military as “training scars,” and they reflect the fact that it’s possible to overfit one’s own preparation.
Russian mathematician Andrey Tikhonov proposed one answer: introduce an additional term to your calculations that penalizes more complex solutions. If we introduce a complexity penalty, then more complex models need to do not merely a better job but a significantly better job of explaining the data to justify their greater complexity. Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.
So what do these complexity penalties look like? One algorithm, discovered in 1996 by biostatistician Robert Tibshirani, is called the Lasso and uses as its penalty the total weight of the different factors in the model.* By putting this downward pressure on the weights of the factors, the Lasso drives as many of them as possible completely to zero.
The same kind of process is also believed to play a role at the neural level. In computer science, software models based on the brain, known as “artificial neural networks,” can learn arbitrarily complex functions—they’re even more flexible than our nine-factor model above—but precisely because of this very flexibility they are notoriously vulnerable to overfitting. Actual, biological neural networks sidestep some of this problem because they need to trade off their performance against the costs of maintaining it. Neuroscientists have suggested, for instance, that brains try to minimize the
...more
For example, the oddly cross-wired arrangement of our nervous system (the left side of our body controlled by the right side of our brain and vice versa) reflects the evolutionary history of vertebrates. This phenomenon, called “decussation,” is theorized to have arisen at a point in evolution when early vertebrates’ bodies twisted 180 degrees with respect to their heads;
As entrepreneurs Jason Fried and David Heinemeier Hansson explain, the further ahead they need to brainstorm, the thicker the pen they use—a clever form of simplification by stroke size:
The upshot of Early Stopping is that sometimes it’s not a matter of choosing between being rational and going with our first instinct. Going with our first instinct can be the rational solution. The more complex, unstable, and uncertain the decision, the more rational an approach that is.
Darwin made up his mind exactly when his notes reached the bottom of the diary sheet. He was regularizing to the page. This is reminiscent of both Early Stopping and the Lasso: anything that doesn’t make the page doesn’t make the decision.
Before leading the country through the American Civil War, before drafting the Emancipation Proclamation or delivering the Gettysburg Address, Abraham Lincoln worked as a “prairie lawyer” in Springfield, Illinois, traveling the Eighth Judicial Circuit twice a year for sixteen years. Being a circuit lawyer meant literally making a circuit—moving through towns in fourteen different counties to try cases, riding hundreds of miles over many weeks. Planning these circuits raised a natural challenge: how to visit all the necessary towns while covering as few miles as possible and without going to
...more
One of the simplest forms of relaxation in computer science is known as Constraint Relaxation. In this technique, researchers remove some of the problem’s constraints and set about solving the problem they wish they had.
For instance, you can relax the traveling salesman problem by letting the salesman visit the same town more than once, and letting him retrace his steps for free. Finding the shortest route under these looser rules produces what’s called the “minimum spanning tree.”
Occasionally it takes a bit of diplomatic finesse, but a Lagrangian Relaxation—where some impossibilities are downgraded to penalties, the inconceivable to the undesirable—enables progress to be made.
The first, Constraint Relaxation, simply removes some constraints altogether and makes progress on a looser form of the problem before coming back to reality. The second, Continuous Relaxation, turns discrete or binary choices into continua: when deciding between iced tea and lemonade, first imagine a 50–50 “Arnold Palmer” blend and then round it up or down. The third, Lagrangian Relaxation, turns impossibilities into mere penalties, teaching the art of bending the rules (or breaking them and accepting the consequences).
In 1777, George-Louis Leclerc, Comte de Buffon, published the results of an interesting probabilistic analysis. If we drop a needle onto a lined piece of paper, he asked, how likely is it to cross one of the lines? Buffon’s work showed that if the needle is shorter than the gap between the lines, the answer is 2⁄π times the needle’s length divided by the length of the gap. For Buffon, deriving this formula was enough. But in 1812, Pierre-Simon Laplace, one of the heroes of chapter 6, pointed out that this result has another implication: one could estimate the value of π simply by dropping
...more
Ulam’s insight—that sampling can succeed where analysis fails—was also crucial to solving some of the difficult nuclear physics problems that arose at Los Alamos. A nuclear reaction is a branching process, where possibilities multiply just as wildly as they do in cards: one particle splits in two, each of which may go on to strike others, causing them to split in turn, and so on. Exactly calculating the chances of some particular outcome of that process, with many, many particles interacting, is hard to the point of impossibility. But simulating it, with each interaction being like turning
...more
But it lurched into practicality in the twentieth century, becoming pivotal in cryptography and online security. As it happens, it is much easier to multiply primes together than to factor them back out. With big enough primes—say, a thousand digits—the multiplication can be done in a fraction of a second while the factoring could take literally millions of years; this makes for what is known as a “one-way function.” In modern encryption, for instance, secret primes known only to the sender and recipient get multiplied together to create huge composite numbers that can be transmitted publicly
...more

