More on this book
Community
Kindle Notes & Highlights
Read between
June 10 - June 27, 2022
One of the most familiar algorithms in sports is the Round-Robin format, where each of n teams eventually plays every one of the other n − 1 teams. While arguably the most comprehensive, it’s also one of the most laborious. Having every team grapple with every other team is like having guests exchange hugs at our dinner party: the dreaded O(n2), quadratic time.
Perhaps the most prevalent tournament format, however, is a bracket tournament—as in the famous NCAA basketball “March Madness,” among many others. The March Madness tournament progresses from the “Round of 64” and the “Round of 32” to the “Sweet 16,” “Elite Eight,” “Final Four,” and the finals. Each round divides the field in half: does that sound familiarly logarithmic? These tournaments are effectively Mergesort, beginning with unsorted pairs of teams and collating, collating, collating them.
“You go to the money, the money doesn’t come to you.” Vendors go to founders, founders go to venture capitalists, venture capitalists go to their limited partners.
“Look at fish, for example: the bigger one is the dominant one. It’s very simple.” And because it’s so simple, it’s peaceful. Unlike chickens and primates, fish make order without shedding blood.
Your closet presents much the same challenge that a computer faces when managing its memory: space is limited, and the goal is to save both money and time. For as long as there have been computers, computer scientists have grappled with the dual problems of what to keep and how to arrange it.
The idea of keeping around pieces of information that you refer to frequently is so powerful that it is used in every aspect of computation. Processors have caches. Hard drives have caches. Operating systems have caches. Web browsers have caches. And the servers that deliver content to those browsers also have caches,
Modern consumer laptops, tablets, and smartphones have on the order of a six-layer memory hierarchy, and managing memory smartly has never been as important to computer science as it is today.
LRU teaches us that the next thing we can expect to need is the last one we needed, while the thing we’ll need after that is probably the second-most-recent one. And the last thing we can expect to need is the one we’ve already gone longest without.
A quarter of all Internet traffic at present is handled by a single corporation, one that manages to stay almost entirely out of the headlines. This Massachusetts-based company is called Akamai, and they’re in the caching business.
If you can create a cache of webpage content that is physically, geographically closer to the people who want it, you can serve those pages faster.
But this deliberately disorganized-looking storage system still has one visible exception: high-demand items are placed in a different area, more quickly accessible than the rest. That area is Amazon’s cache.
Their patent is actually for shipping items that have been recently popular in a given region to a staging warehouse in that region—like having their own CDN for physical goods. Then, when somebody places an order, the item is just down the street. Anticipating the purchases of individuals is challenging, but when predicting the purchases of a few thousand people, the law of large numbers kicks in.
First, when you are deciding what to keep and what to throw away, LRU is potentially a good principle to use—
Second, exploit geography. Make sure things are in whatever cache is closest to the place where they’re typically used.
what we call “cognitive decline”—lags and retrieval errors—may not be about the search process slowing or deteriorating, but (at least partly) an unavoidable consequence of the amount of information we have to navigate getting bigger and bigger.
Thus we encounter the first lesson in single-machine scheduling literally before we even begin: make your goals explicit. We can’t declare some schedule a winner until we know how we’re keeping score. This is something of a theme in computer science: before you can have a plan, you must first choose a metric.
You might already be using Earliest Due Date to tackle your workload, in which case you probably don’t need computer science to tell you that it’s a sensible strategy. What you may not have known, though, is that it’s the optimal strategy. More precisely, it is optimal assuming that you’re only interested in one metric in particular: reducing your maximum lateness. If that’s not your goal, however, then another strategy might be more applicable.
Scheduling theorists call this metric the “sum of completion times.” Minimizing the sum of completion times leads to a very simple optimal algorithm called Shortest Processing Time: always do the quickest task you can.
only prioritize a task that takes twice as long if it’s twice as important.
Computer science can offer us optimal algorithms for various metrics available in single-machine scheduling, but choosing the metric we want to follow is up to us. In many cases, we get to decide what problem we want to be solving.
Minimizing maximum lateness (for serving customers in a coffee shop) or the sum of completion times (for rapidly shortening your to-do list) both cross the line into intractability if some tasks can’t be started until a particular time. But they return to having efficient solutions once preemption is allowed. In both cases, the classic strategies—Earliest Due Date and Shortest Processing Time, respectively—remain the best, with a fairly straightforward modification.
It turns out, though, that even if you don’t know when tasks will begin, Earliest Due Date and Shortest Processing Time are still optimal strategies, able to guarantee you (on average) the best possible performance in the face of uncertainty.
Second, preemption isn’t free. Every time you switch tasks, you pay a price, known in computer science as a context switch. When a computer processor shifts its attention away from a given program, there’s always a certain amount of necessary overhead. It needs to effectively bookmark its place and put aside all of its information related to that program. Then it needs to figure out which program to run next. Finally it must haul out all the relevant information for that program, find its place in the code, and get in gear. None of this switching back and forth is “real work”—that is, none of
...more
Computers multitask through a process called “threading,” which you can think of as being like juggling a set of balls. Just as a juggler only hurls one ball at a time into the air but keeps three aloft, a CPU only works on one program at a time, but by swapping between them quickly enough (on the scale of ten-thousandths of a second) it appears to be playing a movie, navigating the web, and alerting you to incoming email all at once.
Along with considerations of memory, one of the biggest sources of metawork in switching contexts is the very act of choosing what to do next. This, too, can at times swamp the actual doing of the work.
In a thrashing state, you’re making essentially no progress, so even doing tasks in the wrong order is better than doing nothing at all. Instead of answering the most important emails first—which requires an assessment of the whole picture that may take longer than the work itself—maybe you should sidestep that quadratic-time quicksand by just answering the emails in random order, or in whatever order they happen to appear on-screen.
But what slice size should you aim for? Faced with the question of how long to wait between intervals of performing a recurring task, like checking your email, the answer from the perspective of throughput is simple: as long as possible. But that’s not the end of the story; higher throughput, after all, also means lower responsiveness.
The moral is that you should try to stay on a single task as long as possible without decreasing your responsiveness below the minimum acceptable limit. Decide how responsive you need to be—and then, if you want to get things done, be no more responsive than that.
Likewise, if none of your email correspondents require you to respond in less than twenty-four hours, you can limit yourself to checking your messages once a day.
Bayes’s critical insight was that trying to use the winning and losing tickets we see to figure out the overall ticket pool that they came from is essentially reasoning backward. And to do that, he argued, we need to first reason forward from hypotheticals. In other words, we need to first determine how probable it is that we would have drawn the tickets we did if various scenarios were true. This probability—known to modern statisticians as the “likelihood”—gives us the information we need to solve the problem.
Bayes argued that we should accordingly judge it to be more probable that all the raffle tickets are winners than that half of them are, and in turn more probable that half of them are than that only one in a thousand is. Perhaps we had already intuited as much, but Bayes’s logic offers us the ability to quantify that intuition. All things being equal, we should imagine it to be exactly eight times likelier that all the tickets are winners than that half of them are—because the tickets we drew are exactly eight times likelier (100% versus one-in-eight) in that scenario. Likewise, it’s exactly
...more
Bayes, as we saw, had found a way to compare the relative probability of one hypothesis to another. But in the case of a raffle, there is literally an infinite number of hypotheses: one for every conceivable proportion of winning tickets.
Laplace’s Law offers us the first simple rule of thumb for confronting small data in the real world. Even when we’ve made only a few observations—or only one—it offers us practical guidance. Want to calculate the chance your bus is late? The chance your softball team will win? Count the number of times it has happened in the past plus one, then divide by the number of opportunities plus two. And the beauty of Laplace’s Law is that it works equally well whether we have a single data point or millions of them.
Notably, having some preexisting beliefs is crucial for this formula to work.
This sense of what was “in the bag” before the coin flip—the chances for each hypothesis to have been true before you saw any data—is known as the prior probabilities, or “prior” for short. And Bayes’s Rule always needs some prior from you, even if it’s only a guess.
unless we know better we can expect to have shown up precisely halfway into the duration of any given phenomenon.* And if we assume that we’re arriving precisely halfway into something’s duration, the best guess we can make for how long it will last into the future becomes obvious: exactly as long as it’s lasted already.
Without any preconceived expectations, we might use it to obtain predictions for the end of not only the Berlin Wall but any number of other short- and long-lived phenomena.
In the case of a raffle, one way to plead ignorance would be to assume what’s called the “uniform prior,” which considers every proportion of winning tickets to be equally likely.* In the case of the Berlin Wall, an uninformative prior means saying that we don’t know anything about the time span we’re trying to predict: the wall could equally well come down in the next five minutes or last for five millennia.
the Copernican Principle emerges: if we want to predict how long something will last, and have no other knowledge about it whatsoever, the best guess we can make is that it will continue just as long as it’s gone on so far.
This kind of pattern typifies what are called “power-law distributions.” These are also known as “scale-free distributions” because they characterize quantities that can plausibly range over many scales: a town can have tens, hundreds, thousands, tens of thousands, hundreds of thousands, or millions of residents, so we can’t pin down a single value for how big a “normal” town should be.
Good predictions thus begin with having good instincts about when we’re dealing with a normal distribution and when with a power-law distribution. As it turns out, Bayes’s Rule offers us a simple but dramatically different predictive rule of thumb for each.
for any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to be 2, hence the Copernican prediction;
When we apply Bayes’s Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide. For instance, if somebody is younger than the average life span, then simply predict the average; as their age gets close to and then exceeds the average, predict that they’ll live a few years more.
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer.
Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich. Whether we know it or not, we appear to carry around in our heads surprisingly accurate priors about movie grosses and running times, poem lengths, and political terms of office, not to mention human life spans. We don’t need to gather them explicitly; we absorb them from the world.
Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.
The lesson is this: it is indeed true that including more factors in a model will always, by definition, make it a better fit for the data we have already. But a better fit for the available data does not necessarily mean a better prediction.
Granted, a model that’s too simple—for instance, the straight line of the one-factor formula—can fail to capture the essential pattern in the data. If the truth looks like a curve, no straight line can ever get it right. On the other hand, a model that’s too complicated, such as our nine-factor model here, becomes oversensitive to the particular data points that we happened to observe. As a consequence, precisely because it is tuned so finely to that specific data set, the solutions it produces are highly variable. If the study were repeated with different people, producing slight variations
...more
In other words, overfitting poses a danger any time we’re dealing with noise or mismeasurement—and we almost always are. There can be errors in how the data were collected, or in how they were reported.
As a consequence, considering more and more factors and expending more effort to model them can lead us into the error of optimizing for the wrong thing