The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
Rate it:
27%
Flag icon
Overfitting happens when you have too many hypotheses and not enough data to tell them apart. The bad news is that even for the simple conjunctive learner, the number of hypotheses grows exponentially with the number of attributes.
27%
Flag icon
Bottom line: learning is a race between the amount of data you have and the number of hypotheses you consider. More data exponentially reduces the number of hypotheses that survive, but if you start with a lot of them, you may still have some bad ones left at the end.
27%
Flag icon
. If the patterns the learner hypothesized also hold true on new data, you can be pretty confident that they’re real. Otherwise you know the learner overfit. This is just the scientific method applied to machine learning: it’s not enough for a new theory to explain past evidence because it’s easy to concoct a theory that does that; the theory must also make new predictions, and you only accept it after they’ve been experimentally verified. (And even then only provisionally, because future evidence could still falsify it.)
28%
Flag icon
But you don’t need to wait around for new data to arrive to decide whether you can trust your learner. Rather, you take the data you have and randomly divide it into a training set, which you give to the learner, and a test set, which you hide from it and use to verify its accuracy. Accuracy on held-out data is the gold standard in machine learning. You can write a paper about a great new learning algorithm you’ve invented, but if your algorithm is not significantly more accurate than previous ones on held-out data, the paper is not publishable.
28%
Flag icon
If your learner’s test-set accuracy disappoints, you need to diagnose the problem. Was it blindness or hallucination? In machine learning, the technical terms for these are bias and variance. A clock that’s always an hour late has high bias but low variance. If instead the clock alternates erratically between fast and slow but on average tells the right time, it has high variance but low bias.
29%
Flag icon
You can estimate the bias and variance of a learner by comparing its predictions after learning on random variations of the training set. If it keeps making the same mistakes, the problem is bias, and you need a more flexible learner (or just a different one). If there’s no pattern to the mistakes, the problem is variance, and you want to either try a less flexible learner or get more data. Most learners have a knob you can turn to make them more or less flexible, such as the threshold for significance tests or the penalty on the size of the model. Tweaking that knob is your first resort.
31%
Flag icon
A decision tree is like playing a game of twenty questions with an instance. Starting at the root, each node asks about the value of one attribute, and depending on the answer, we follow one or another branch. When we arrive at a leaf, we read off the predicted concept. Each path from the root to a leaf corresponds to a rule. If this reminds you of those annoying phone menus you have to get through when you call customer service, it’s not an accident: a phone menu is a decision tree. The computer on the other end of the line is playing a game of twenty questions with you to figure out what you ...more
31%
Flag icon
According to the decision tree above, you’re either a Republican, a Democrat, or an independent; you can’t be more than one, or none of the above. Sets of concepts with this property are called sets of classes, and the algorithm that predicts them is a classifier. A single concept implicitly defines two classes: the concept itself and its negation. (For example, spam and nonspam.) Classifiers are the most widespread form of machine learning.
32%
Flag icon
The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols. A mathematician solves equations by moving symbols around and replacing symbols by other symbols according to predefined rules. The same is true of a logician carrying out deductions. According to this hypothesis, intelligence is independent of the substrate; it doesn’t matter if the symbol manipulations are done by writing on a blackboard, switching transistors on and off, firing neurons, or playing with Tinkertoys. If you have a setup with the power of a universal Turing machine, you can do anything. ...more
32%
Flag icon
Because of its origins and guiding principles, symbolist machine learning is still closer to the rest of AI than the other schools. If computer science were a continent, symbolist learning would share a long border with knowledge engineering.
32%
Flag icon
Symbolism is the shortest path to the Master Algorithm. It doesn’t require us to figure out how evolution or the brain works, and it avoids the mathematical complexities of Bayesianism. Sets of rules and decision trees are easy to understand, so we know what the learner is up to. This makes it easier to figure out what it’s doing right and wrong, fix the latter, and have confidence in the results.
32%
Flag icon
Converting a decision tree to a set of rules is easy: each path from the root to a leaf becomes a rule, and there’s no blowup. On the other hand, in the worst case converting a set of rules into a decision tree requires converting each rule into a mini-decision tree, and then replacing each leaf of rule 1’s tree with a copy of rule 2’s tree, each leaf of each copy of rule 2 with a copy of rule 3, and so on, causing a massive blowup.
32%
Flag icon
Connectionists, in particular, are highly critical of symbolist learning.
32%
Flag icon
If you want to reverse engineer a car, you look under the hood. If you want to reverse engineer the brain, you look inside the skull.
33%
Flag icon
If you ask a symbolist system where the concept “New York” is represented, it can point to the precise location in memory where it’s stored. In a connectionist system, the answer is “it’s stored a little bit everywhere.”
33%
Flag icon
Another difference between symbolist and connectionist learning is that the former is sequential, while the latter is parallel.
33%
Flag icon
The brain is a forest of billions of these trees, but there’s something unusual about them. Each tree’s branches make connections—synapses—to the roots of thousands of others, forming a massive tangle like nothing you’ve ever seen. Some neurons have short axons and some have exceedingly long ones, reaching clear from one side of the brain to the other. Placed end to end, the axons in your brain would stretch from Earth to the moon.
33%
Flag icon
And this jungle crackles with electricity. Sparks run along tree trunks and set off more sparks in neighboring trees.
33%
Flag icon
The end result of this phenomenally complex pattern of neuron firings is your consciousness.
34%
Flag icon
we know that synapses do grow (or form anew) when the postsynaptic neuron fires soon after the presynaptic one.
34%
Flag icon
If enough presynaptic neurons fire close together, the voltage suddenly spikes, and an action potential travels down the postsynaptic neuron’s axon. This also causes the ion channels to become more responsive and new channels to appear, strengthening the synapse. To the best of our knowledge, this is how neurons learn. The next step is to turn it into an algorithm.
34%
Flag icon
What the McCulloch-Pitts neuron doesn’t do is learn. For that we need to give variable weights to the connections between neurons, resulting in what’s called a perceptron.
34%
Flag icon
In a perceptron, a positive weight represents an excitatory connection, and a negative weight an inhibitory one. The
35%
Flag icon
The knowledge engineers were irritated by Rosenblatt’s claims and envious of all the attention and funding neural networks, and perceptrons in particular, were getting. One of them was Marvin Minsky, a former classmate of Rosenblatt’s at the Bronx High School of Science and by then the leader of the AI group at MIT. (Ironically, his PhD had been on neural networks, but he had grown disillusioned with them.) In 1969, Minsky and his colleague Seymour Papert published Perceptrons, a book detailing the shortcomings of the eponymous algorithm, with example after example of simple things it couldn’t ...more
35%
Flag icon
A perceptron models only a single neuron’s learning, however, and although Minsky and Papert acknowledged that layers of interconnected neurons should be capable of more, they didn’t see a way to learn them. Neither did anyone else.
35%
Flag icon
And Prince Charming would be a Caltech physicist by the name of John Hopfield. In 1982, Hopfield noticed a striking analogy between the brain and spin glasses, an exotic material much beloved of statistical physicists. This set off a connectionist renaissance that culminated a few years later in the invention of the first algorithms capable of solving the credit-assignment problem, ushering in a new era where machine learning replaced knowledge engineering as the dominant paradigm in AI.
36%
Flag icon
For one, spin interactions are symmetric, and connections between neurons in the brain are not. Another big issue that Hopfield’s model ignored is that real neurons are statistical: they don’t deterministically turn on and off as a function of their inputs; rather, as the weighted sum of inputs increases, the neuron becomes more likely to fire, but it’s not certain that it will. In 1985, David Ackley, Geoff Hinton, and Terry Sejnowski replaced the deterministic neurons in Hopfield networks with probabilistic ones. A neural network now had a probability distribution over its states, with ...more
36%
Flag icon
Hinton, a psychologist turned computer scientist and great-great-grandson of George Boole, the inventor of the logical calculus used in all digital computers, is the world’s leading connectionist. He has tried longer and harder to understand how the brain works than anyone else.
36%
Flag icon
Rather than a logic gate, a neuron is more like a voltage-to-frequency converter. The curve of frequency as a function of voltage looks like this:
36%
Flag icon
The transfer curve of a transistor, which relates its input and output voltages, is also an S curve. So both computers and the brain are filled with S curves. But it doesn’t end there. The S curve is the shape of phase transitions of all kinds: the probability of an electron flipping its spin as a function of the applied field, the magnetization of iron, the writing of a bit of memory to a hard disk, an ion channel opening in a cell, ice melting, water evaporating, the inflationary expansion of the early universe, punctuated equilibria in evolution, paradigm shifts in science, the spread of ...more
37%
Flag icon
The universe is a vast symphony of phase transitions, from the cosmic to the microscopic, from the mundane to the life changing.
37%
Flag icon
When someone talks about exponential growth, ask yourself: How soon will it turn into an S curve?
37%
Flag icon
Add a succession of staggered upward and downward S curves, and you get something close to a sine wave. In fact, every function can be closely approximated by a sum of S curves: when the function goes up, you add an S curve; when it goes down, you subtract one.
37%
Flag icon
Backprop is an efficient way to do it in a multilayer perceptron: keep tweaking the weights so as to lower the error, and stop when all tweaks fail. With backprop, you don’t have to figure out how to tweak each neuron’s weights from scratch, which would be too slow; you can do it layer by layer, tweaking each neuron based on how you tweaked the neurons it connects to. If you had to throw out your entire machine-learning toolkit in an emergency save for one tool, gradient descent is probably the one you’d want to hold on to.
38%
Flag icon
Backprop was invented in 1986 by David Rumelhart, a psychologist at the University of California, San Diego, with the help of Geoff Hinton and Ronald Williams.
39%
Flag icon
In fact, Rumelhart is credited with inventing backprop by the Columbus test: Columbus was not the first person to discover America, but the last.
39%
Flag icon
An autoencoder is a multilayer perceptron whose output is the same as its input. In goes a picture of your grandmother and out comes—the same picture of your grandmother. At first this seems like a silly idea: What use could such a contraption possibly be?
40%
Flag icon
So an autoencoder is not unlike a file compression tool, with two important advantages: it figures out how to compress things on its own, and like Hopfield networks, it can turn a noisy, distorted image into a nice clean one.
40%
Flag icon
If we allow different bits to represent different inputs, the inputs no longer have to compete to set the same bits. Also, the network now has many more parameters, so the hyperspace you’re in has many more dimensions, and you have many more ways to get out of what would otherwise be local maxima. This is called a sparse autoencoder, and it’s a neat trick. We haven’t seen any deep learning yet, though. The next clever idea is to stack sparse autoencoders on top of each other like a club sandwich. The hidden layer of the first autoencoder becomes the input/output layer of the second one, and so ...more
40%
Flag icon
Ng, whose affability belies a fierce ambition, believes that stacked sparse autoencoders can take us closer to solving AI than anything that came before.
40%
Flag icon
The Google network is still pretty shallow; only three of its nine layers are autoencoders. A multilayer perceptron is a passable model of the cerebellum, the part of the brain responsible for low-level motor control, but the cortex is another story. It’s missing the backward connections needed to propagate errors, for one, and yet it’s where the real learning wizardry resides.
40%
Flag icon
The nervous system of the C. elegans worm consists of only 302 neurons and was completely mapped in 1986, but we still have only a fragmentary understanding of what it does.
40%
Flag icon
We don’t build airplanes by reverse engineering feathers, and airplanes don’t flap their wings. Rather, airplane designs are based on the principles of aerodynamics, which all flying objects must obey. We still do not understand those analogous principles of thought.
40%
Flag icon
Neural networks are not compositional, and compositionality is a big part of human cognition. Another big issue is that humans—and symbolic models like sets of rules and decision trees—can explain their reasoning, while neural networks are big piles of numbers that no one can understand.
41%
Flag icon
The algorithm that evolved these robots was invented by Charles Darwin in the nineteenth century. He didn’t think of it as an algorithm at the time, partly because a key subroutine was still missing. Once James Watson and Francis Crick provided it in 1953, the stage was set for the second coming of evolution: in silico instead of in vivo, and a billion times faster.
41%
Flag icon
and in 1959 he earned the world’s first PhD in computer science.
42%
Flag icon
since the fittest humans in history as measured by number of descendants are the likes of Genghis Khan—ancestor to one in two hundred men alive today—perhaps it’s not so bad that in real life immortality is verboten
42%
Flag icon
Sets of rules like this, which Holland called classifier systems, are one of the workhorses of the machine-learning tribe he founded: the evolutionaries. Like multilayer perceptrons, classifier systems face the credit-assignment problem—what is the fitness of rules for intermediate concepts?—and Holland devised the so-called bucket brigade algorithm to solve it. Nevertheless, classifier systems are much less widely used than multilayer perceptrons.
45%
Flag icon
Depending on your point of view, either Lang’s paper or Koza’s response was the last straw; regardless, the Tahoe incident marked the final divorce between the evolutionaries and the rest of the machine-learning community, with the evolutionaries moving out of the house. Genetic programmers started their own conference, which merged with the genetic algorithms conference to form GECCO, the Genetic and Evolutionary Computing Conference. For its part, the machine-learning mainstream largely forgot them. A sad dénouement, but not the first time in history that sex is to blame for a breakup.
46%
Flag icon
The Master Algorithm is neither genetic programming nor backprop, but it has to include the key elements of both: structure learning and weight learning.