The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
Rate it:
14%
Flag icon
As Isaiah Berlin memorably noted, some thinkers are foxes—they know many small things—and some are hedgehogs—they know one big thing. The same is true of learning algorithms.
15%
Flag icon
Routine jobs will be automated and replaced by more interesting ones. Every job will be done better than it is today, whether by a better-trained human, a computer, or a combination of the two.
15%
Flag icon
The algorithms we drive when we use Google, Facebook, or the latest analytics suite are a bit like a black limo with tinted windows that mysteriously shows up at our door one night: Should we get in? Where will it take us? It’s time to get in the driver’s seat. Knowing the assumptions that different learners make will help us pick the right one for the job, instead of going with a random one that fell into our lap—and then suffering with it for years, painfully rediscovering what we should have known from the start. By knowing what learners optimize, we can make certain they optimize what we ...more
15%
Flag icon
Just because computers can learn doesn’t mean they magically acquire a will of their own. Learners learn to achieve the goals we set them; they don’t get to change the goals.
15%
Flag icon
He who controls the data controls the learner.
15%
Flag icon
A theory is a set of constraints on what the world could be, not a complete description of it.
16%
Flag icon
The power of a theory lies in how much it simplifies our description of the world.
17%
Flag icon
Mathematicians like to say that God can disobey the laws of physics, but even he cannot defy the laws of logic. This may be so, but the laws of logic are for deduction; what we need is something equivalent, but for induction.
17%
Flag icon
Which raises the question: how will we know when we’ve found the Master Algorithm? When the same learner, with only parameter changes and minimal input aside from the data, can understand video and text as well as humans, and make significant new discoveries in biology, sociology, and other sciences. Clearly, by this standard no learner has yet been demonstrated to be the Master Algorithm, even in the unlikely case one already exists.
17%
Flag icon
Our search for the Master Algorithm is complicated, but also enlivened, by the rival schools of thought that exist within machine learning. The main ones are the symbolists, connectionists, evolutionaries, Bayesians, and analogizers. Each tribe has a set of core beliefs, and a particular problem that it cares most about. It has found a solution to that problem, based on ideas from its allied fields of science, and it has a master algorithm that embodies
17%
Flag icon
For symbolists, all intelligence can be reduced to manipulating symbols, in the same way that a mathematician solves equations by replacing expressions by other expressions. Symbolists understand that you can’t learn from scratch: you need some initial knowledge to go with the data. They’ve figured out how to incorporate preexisting knowledge into learning, and how to combine different pieces of knowledge on the fly in order to solve new problems. Their master algorithm is inverse deduction, which figures out what knowledge is missing in order to make a deduction go through, and then makes it ...more
17%
Flag icon
For connectionists, learning is what the brain does, and so what we need to do is reverse engineer it. The brain learns by adjusting the strengths of connections between neurons, and the crucial problem is figuring out which connections are to blame for which errors and changing them accordingly. The connectionists’ master algorithm is backpropagation, which compares a system’s output with the desired one and then successivel...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
Evolutionaries believe that the mother of all learning is natural selection. If it made us, it can make anything, and all we need to do is simulate it on the computer. The key problem that evolutionaries solve is learning structure: not just adjusting parameters, like backpropagation does, but creating the brain that those adjustments can then fine-tune. The evolutionaries’ master algorithm is genetic prog...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
Bayesians are concerned above all with uncertainty. All learned knowledge is uncertain, and learning itself is a form of uncertain inference. The problem then becomes how to deal with noisy, incomplete, and even contradictory information without falling apart. The solution is probabilistic inference, and the master algorithm is Bayes’ theorem and its derivates. Bayes’ theorem tells us how to incorporate...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
For analogizers, the key to learning is recognizing similarities between situations and thereby inferring other similarities. If two patients have similar symptoms, perhaps they have the same disease. The key problem is judging how similar two things are. The analogizers’ master algorithm is the support vector machine, which fig...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
Each tribe’s solution to its central problem is a brilliant, hard-won advance. But the true Master Algorithm must solv...
This highlight has been truncated due to consecutive passage length restrictions.
18%
Flag icon
Rationalists believe that the senses deceive and that logical reasoning is the only sure path to knowledge. Empiricists believe that all reasoning is fallible and that knowledge must come from observation and experimentation.
18%
Flag icon
How can we ever be justified in generalizing from what we’ve seen to what we haven’t? Every learning algorithm is, in a sense, an attempt to answer this question.
18%
Flag icon
This leads us to the most important problem in machine learning: overfitting, or hallucinating patterns that aren’t really there.
19%
Flag icon
Is there any way to learn something from the past that we can be confident will apply in the future? And if there isn’t, isn’t machine learning a hopeless enterprise? For that matter, isn’t all of science, even all of human knowledge, on rather shaky ground?
19%
Flag icon
Philosophers have debated Hume’s problem of induction ever since he posed it, but no one has come up with a satisfactory answer. Bertrand Russell liked to illustrate the problem with the story of the inductivist turkey.
19%
Flag icon
Two hundred and fifty years after Hume set off his bombshell, it was given elegant mathematical form by David Wolpert, a physicist turned machine learner. His result, known as the “no free lunch” theorem, sets a limit on how good a learner can be. The limit is pretty low: no learner can be better than random guessing!
19%
Flag icon
The “no free lunch” theorem is a lot like the reason Pascal’s wager fails. In his Pensées, published in 1669, Pascal said we should believe in the Christian God because if he exists that gains us eternal life, and if he doesn’t we lose very little.
20%
Flag icon
In the meantime, the practical consequence of the “no free lunch” theorem is that there’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump. Machine learning is what mathematicians call an ill-posed problem: it doesn’t have a unique solution.
20%
Flag icon
Tom Mitchell, a leading symbolist, calls it “the futility of bias-free learning.” In ordinary life, bias is a pejorative word: preconceived notions are bad. But in machine learning, preconceived notions are indispensable; you can’t learn without them. In fact, preconceived notions are also indispensable to human cognition, but they’re hardwired into the brain, and we take them for granted. It’s biases over and beyond those that are questionable.
20%
Flag icon
In the Principia, along with his three laws of motion, Newton enunciates four rules of induction. Although these are much less well known than the physical laws, they are arguably as important. The key rule is the third one, which we can paraphrase thus: Newton’s Principle: Whatever is true of everything we’ve seen is true of everything in the universe.
20%
Flag icon
Newton’s principle is the first unwritten rule of machine learning. We induce the most widely applicable rules we can and reduce their scope only when the data forces us to. At first sight this may seem ridiculously overconfident, but it’s been working for science for over three hundred years. It’s certainly possible to imagine a universe so varied and capricious that Newton’s principle would systematically fail, but that’s not our universe. Newton’s principle is only the first step, however. We still need to figure out what is true of everything we’ve seen—how to extract the regularities from ...more
20%
Flag icon
If all conjunctions of two factors fail, you can try all conjunctions of any number of factors. Machine learners and psychologists call these “conjunctive concepts.” Dictionary definitions are conjunctive concepts: a chair has a seat and a back and some number of legs. Remove any of these and it’s no longer a chair.
21%
Flag icon
In machine learning, examples of a concept are called positive examples, and counterexamples are called negative examples.
21%
Flag icon
We can learn sets of rules like this one rule at a time, using the algorithm we saw before for learning conjunctive concepts. After we learn each rule, we discard the positive examples that it accounts for, so the next rule tries to account for as many of the remaining positive examples as possible, and so on until all are accounted for. It’s an example of “divide and conquer,” the oldest strategy in the scientist’s playbook. We can also improve the algorithm for finding a single rule by keeping some number n of hypotheses around, not just one, and at each step extending all of them in all ...more
21%
Flag icon
Discovering rules in this way was the brainchild of Ryszard Michalski, a Polish computer scientist. Michalski’s hometown of Kalusz was successively part of Poland, Russia, Germany, and Ukraine, which may have left him more attuned than most to disjunctive concepts. After immigrating to the United States in 1970, he went on to found the symbolist school of machine learning, along with Tom Mitchell and Jaime Carbonell. He had an imperious personality. If you gave a talk at a machine-learning conference, the odds were good that at the end he’d raise his hand to point out that you had just ...more
22%
Flag icon
In his story “Funes the Memorious,” Jorge Luis Borges tells of meeting a youth with perfect memory. This might at first seem like a great fortune, but it is in fact an awful curse.
22%
Flag icon
Whenever a learner finds a pattern in the data that is not actually true in the real world, we say that it has overfit the data. Overfitting is the central problem in machine learning.
22%
Flag icon
Our beliefs are based on our experience, which gives us a very incomplete picture of the world, and it’s easy to jump to false conclusions.
22%
Flag icon
Overfitting happens when you have too many hypotheses and not enough data to tell them apart.
23%
Flag icon
Bottom line: learning is a race between the amount of data you have and the number of hypotheses you consider. More data exponentially reduces the number of hypotheses that survive, but if you start with a lot of them, you may still have some bad ones left at the end. As a rule of thumb, if the learner only considers an exponential number of hypotheses (for example, all possible conjunctive concepts), then the data’s exponential payoff cancels it and you’re OK, provided you have plenty of examples and not too many attributes. On the other hand, if it considers a doubly exponential number (for ...more
23%
Flag icon
Harvard’s Leslie Valiant received the Turing Award, the Nobel Prize of computer science, for inventing this type of analysis, which he describes in his book entitled, appropriately enough, Probably Approximately Correct.
23%
Flag icon
So how do you decide whether to believe what a learner tells you? Simple: you don’t believe anything until you’ve verified it on data that the learner didn’t see. If the patterns the learner hypothesized also hold true on new data, you can be pretty confident that they’re real. Otherwise you know the learner overfit. This is just the scientific method applied to machine learning: it’s not enough for a new theory to explain past evidence because it’s easy to concoct a theory that does that; the theory must also make new predictions, and you only accept it after they’ve been experimentally ...more
24%
Flag icon
The preference for simpler hypotheses is popularly known as Occam’s razor, but in a machine-learning context this is somewhat misleading. “Entities should not be multiplied beyond necessity,” as the razor is often paraphrased, just means choosing the simplest theory that fits the data. Occam would probably have been perplexed by the notion that we should prefer a theory that does not perfectly account for the evidence on the grounds that it will generalize better. Simple theories are preferable because they incur a lower cognitive cost (for us) and a lower computational cost (for our ...more
24%
Flag icon
If your learner’s test-set accuracy disappoints, you need to diagnose the problem. Was it blindness or hallucination? In machine learning, the technical terms for these are bias and variance. A clock that’s always an hour late has high bias but low variance. If instead the clock alternates erratically between fast and slow but on average tells the right time, it has high variance but low bias.
24%
Flag icon
You can estimate the bias and variance of a learner by comparing its predictions after learning on random variations of the training set. If it keeps making the same mistakes, the problem is bias, and you need a more flexible learner (or just a different one). If there’s no pattern to the mistakes, the problem is variance, and you want to either try a less flexible learner or get more data. Most learners have a knob you can turn to make them more or less flexible, such as the threshold for significance tests or the penalty on the size of the model. Tweaking that knob is your first resort.
24%
Flag icon
The key is to realize that induction is just the inverse of deduction, in the same way that subtraction is the inverse of addition, or integration the inverse of differentiation. This idea was first proposed by William Stanley Jevons in the late 1800s. Steve Muggleton and Wray Buntine, an English Australian team, designed the first practical algorithm based on it in 1988. The strategy of taking a well-known operation and figuring out its inverse has a storied history in mathematics.
24%
Flag icon
The first statement is a fact about Socrates, and the second is a general rule about humans. What follows? That Socrates is mortal, of course, by applying the rule to Socrates. In inductive reasoning we start instead with the initial and derived facts, and look for a rule that would allow us to infer the latter from the former:
25%
Flag icon
Inverting an operation is often difficult because the inverse is not unique. For example, a positive number has two square roots, one positive and one negative (22 = (–2)2 = 4). Most famously, integrating the derivative of a function only recovers the function up to a constant. The derivative of a function tells us how much that function goes up or down at each point. Adding up all those changes gives us the function back, except we don’t know where it started; we can “slide” the integrated function up or down without changing the derivative.
26%
Flag icon
The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols.
26%
Flag icon
Symbolist machine learners share this belief in the power of symbol manipulation with many other computer scientists, psychologists, and philosophers. The psychologist David Marr argued that every information processing system should be studied at three distinct levels: the fundamental properties of the problem it’s solving; the algorithms and representations used to solve it; and how they are physically implemented.
27%
Flag icon
Connectionists, in particular, are highly critical of symbolist learning. According to them, concepts you can define with logical rules are only the tip of the iceberg; there’s a lot going on under the surface that formal reasoning just can’t see, in the same way that most of what goes on in our minds is subconscious.
27%
Flag icon
Hebb’s rule, as it has come to be known, is the cornerstone of connectionism. Indeed, the field derives its name from the belief that knowledge is stored in the connections between neurons. Donald Hebb, a Canadian psychologist, stated it this way in his 1949 book The Organization of Behavior: “When an axon of cell A is near enough cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” It’s often paraphrased as “Neurons that fire together wire ...more
27%
Flag icon
In his Principles of Psychology, William James enunciates a general principle of association that’s remarkably similar to Hebb’s rule, with neurons replaced by brain processes and firing efficiency by propagation of excitement.
27%
Flag icon
In symbolist learning, there is a one-to-one correspondence between symbols and the concepts they represent. In contrast, connectionist representations are distributed: each concept is represented by many neurons, and each neuron participates in representing many different concepts. Neurons that excite one another form what Hebb called a cell assembly. Concepts and memories are represented in the brain by cell assemblies. Each of these can include neurons from different brain regions and overlap with other assemblies. The cell assembly for “leg” includes the one for “foot,” which includes ...more