Doug Lautzenheiser’s Kindle Notes & Highlights for The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World

Learning a perceptron’s weights means varying the direction of the straight line until all the positive examples are on one side and all the negative ones on the other. In one dimension, the boundary is a point; in two, it’s a straight line; in three, it’s a plane; and in more than three, it’s a hyperplane. It’s hard to visualize things in hyperspace, but the math works just the same way. In n dimensions, we have n inputs and the perceptron has n weights. To decide whether the perceptron fires or not, we multiply each weight by the corresponding input and compare the sum of all of them with ...more

29%

But then the perceptron hit a brick wall. The knowledge engineers were irritated by Rosenblatt’s claims and envious of all the attention and funding neural networks, and perceptrons in particular, were getting. One of them was Marvin Minsky, a former classmate of Rosenblatt’s at the Bronx High School of Science and by then the leader of the AI group at MIT. (Ironically, his PhD had been on neural networks, but he had grown disillusioned with them.) In 1969, Minsky and his colleague Seymour Papert published Perceptrons, a book detailing the shortcomings of the eponymous algorithm, with example ...more

29%

Since perceptrons can only learn linear boundaries, they can’t learn XOR. And if they can’t do even that, they’re not a very good model of how the brain learns, or a viable candidate for the Master Algorithm.

29%

The problem is that there’s no clear way to change the weights of the neurons in the “hidden” layers to reduce the errors made by the ones in the output layer. Every hidden neuron influences the output via multiple paths, and every error has a thousand fathers. Who do you blame? Or, conversely, who gets the credit for correct outputs? This credit-assignment problem shows up whenever we try to learn a complex model and is one of the central problems in machine learning.

29%

If the history of machine learning were a Hollywood movie, the villain would be Marvin Minsky. He’s the evil queen who gives Snow White a poisoned apple, leaving her in suspended animation. (In a 1988 essay, Seymour Papert even compared himself, tongue-in-cheek, to the huntsman the queen sent to kill Snow White in the forest.) And Prince Charming would be a Caltech physicist by the name of John Hopfield. In 1982, Hopfield noticed a striking analogy between the brain and spin glasses, an exotic material much beloved of statistical physicists. This set off a connectionist renaissance that ...more

30%

In 1985, David Ackley, Geoff Hinton, and Terry Sejnowski replaced the deterministic neurons in Hopfield networks with probabilistic ones. A neural network now had a probability distribution over its states, with higher-energy states being exponentially less likely than lower-energy ones. In fact, the probability of finding the network in a particular state was given by the well-known Boltzmann distribution from thermodynamics, so they called their network a Boltzmann machine.

30%

During sleep, the machine dreams, leaving both sensory and hidden neurons free to wander. Just before the new day dawns, it compares the statistics of its states during the dream and during yesterday’s activities and changes the connection weights so that they match. If two neurons tend to fire together during the day but less so while asleep, the weight of their connection goes up; if it’s the opposite, they go down. By doing this day after day, the predicted correlations between sensory neurons evolve until they match the real ones. At this point, the Boltzmann machine has learned a good ...more

30%

Hinton’s latest passion is deep learning, which we’ll meet later in this chapter. He was also involved in the development of backpropagation, an even better algorithm than Boltzmann machines for solving the credit-assignment problem that we’ll look at next. Boltzmann machines could solve the credit-assignment problem in principle, but in practice learning was very slow and painful, making this approach impractical for most applications. The next breakthrough involved getting rid of another oversimplification that dated all the way back to McCulloch and Pitts.

30%

This curve, which looks like an elongated S, is variously known as the logistic, sigmoid, or S curve. Peruse it closely, because it’s the most important curve in the world. At first the output increases slowly with the input, so slowly it seems constant. Then it starts to change faster, then very fast, then slower and slower until it becomes almost constant again.

30%

In Hemingway’s The Sun Also Rises, when Mike Campbell is asked how he went bankrupt, he replies, “Two ways. Gradually and then suddenly.” The same could be said of Lehman Brothers. That’s the essence of an S curve. One of the futurist Paul Saffo’s rules of forecasting is: look for the S curves.

30%

The S curve is not just important as a model in its own right; it’s also the jack-of-all-trades of mathematics. If you zoom in on its midsection, it approximates a straight line. Many phenomena we think of as linear are in fact S curves, because nothing can grow without limit.

30%

When someone talks about exponential growth, ask yourself: How soon will it turn into an S curve? When will the population bomb peter out, Moore’s law lose steam, or the singularity fail to happen? Differentiate an S curve and you get a bell curve: slow, fast, slow becomes low, high, low. Add a succession of staggered upward and downward S curves, and you get something close to a sine wave. In fact, every function can be closely approximated by a sum of S curves: when the function goes up, you add an S curve; when it goes down, you subtract one.

31%

Most importantly for us, S curves lead to a new solution to the credit-assignment problem. If the universe is a symphony of phase transitions, let’s model it with one. That’s what the brain does: it tunes the system of phase transitions inside to the one outside. So let’s replace the perceptron’s step function with an S curve and see what happens.

31%

A single neuron could only learn straight lines. Given enough hidden neurons, a multilayer perceptron, as it’s called, can represent arbitrarily convoluted frontiers. This makes backpropagation—or simply backprop—the connectionists’ master algorithm. Backprop is an instance of a strategy that is very common in both nature and technology: if you’re in a hurry to get to the top of the mountain, climb the steepest slope you can find. The technical term for this is gradient ascent (if you want to get to the top) or gradient descent (if you’re looking for the valley bottom). Bacteria can find food ...more

31%

Imagine you’ve been kidnapped and left blindfolded somewhere in the Himalayas. Your head is throbbing, and your memory is not too good, either. All you know is you need to get to the top of Mount Everest. What do you do? You take a step forward and nearly slide into a ravine. After catching your breath, you decide to be a bit more systematic. You carefully feel around with your foot until you find the highest point you can and step gingerly to that point. Then you do the same again. Little by little, you get higher and higher. After a while, every step you can take is down, and you stop. ...more

32%

We could do away with the problem of local optima by taking out the S curves and just letting each neuron output the weighted sum of its inputs. That would make the error surface very smooth, leaving only one minimum—the global one. The problem, though, is that a linear function of linear functions is still just a linear function, so a network of linear neurons is no better than a single neuron. A linear brain, no matter how large, is dumber than a roundworm. S curves are a nice halfway house between the dumbness of linear functions and the hardness of step functions.

32%

Backprop was invented in 1986 by David Rumelhart, a psychologist at the University of California, San Diego, with the help of Geoff Hinton and Ronald Williams. Among other things, they showed that backprop can learn XOR, enabling connectionists to thumb their noses at Minsky and Papert.

32%

In an early demonstration of the power of backprop, Terry Sejnowski and Charles Rosenberg trained a multilayer perceptron to read aloud. Their NETtalk system scanned the text, selected the correct phonemes according to context, and fed them to a speech synthesizer. NETtalk not only generalized accurately to new words, which knowledge-based systems could not, but it learned to speak in a remarkably human-like way. Sejnowski used to mesmerize audiences at research meetings by playing a tape of NETtalk’s progress: babbling at first, then starting to make sense, then speaking smoothly with only ...more

32%

Scientists everywhere use linear regression because that’s what they know, but more often than not the phenomena they study are nonlinear, and a multilayer perceptron can model them. Linear models are blind to phase transitions; neural networks soak them up like a sponge.

32%

It turns out that, as is often the case in science, backprop was invented more than once. Yann LeCun in France and others hit on it at around the same time as Rumelhart. A paper on backprop was rejected by the leading AI conference in the early 1980s because, according to the reviewers, Minsky and Papert had already proved that perceptrons don’t work. In fact, Rumelhart is credited with inventing backprop by the Columbus test: Columbus was not the first person to discover America, but the last. It turns out that Paul Werbos, a graduate student at Harvard, had proposed a similar algorithm in ...more

33%

In truth, connectionists have made genuine progress. One of the protagonists of this latest twist in the connectionist roller coaster is an unassuming little device called an autoencoder. An autoencoder is a multilayer perceptron whose output is the same as its input. In goes a picture of your grandmother and out comes—the same picture of your grandmother. At first this seems like a silly idea: What use could such a contraption possibly be? The key is to make the hidden layer much smaller than the input and output layers, so the network can’t just learn to copy the input to the hidden layer ...more

33%

The Google Brain network of New York Times fame is a nine-layer sandwich of autoencoders and other ingredients that learns to recognize cats from YouTube videos. At one billion connections, it was at the time the largest network ever learned. It’s no surprise that Andrew Ng, one of the project’s principals, is also one of the leading proponents of the idea that human intelligence boils down to a single algorithm, and all we need to do is figure it out. Ng, whose affability belies a fierce ambition, believes that stacked sparse autoencoders can take us closer to solving AI than anything that ...more

33%

In his book On Intelligence, Jeff Hawkins advocated designing algorithms closely based on the organization of the cortex, but so far none of these algorithms can compete with today’s deep networks.

34%

The algorithm that evolved these robots was invented by Charles Darwin in the nineteenth century. He didn’t think of it as an algorithm at the time, partly because a key subroutine was still missing. Once James Watson and Francis Crick provided it in 1953, the stage was set for the second coming of evolution: in silico instead of in vivo, and a billion times faster. Its prophet was a ruddy-faced, perpetually grinning midwesterner by the name of John Holland.

34%

Like many other early machine-learning researchers, Holland started out working on neural networks, but his interests took a different turn when, while a graduate student at the University of Michigan, he read Ronald Fisher’s classic treatise The Genetical Theory of Natural Selection. In it, Fisher, who was also the founder of modern statistics, formulated the first mathematical theory of evolution.

34%

The key input to a genetic algorithm, as Holland’s creation came to be known, is a fitness function. Given a candidate program and some purpose it is meant to fill, the fitness function assigns the program a numeric score reflecting how well it fits the purpose.

34%

Instead of 0 and 1, the DNA alphabet has four characters—the four bases adenine, thymine, cytosine, and guanine—but that’s a superficial difference. Variations, whether in DNA sequences or bit strings, can be generated in several ways. The simplest approach is point mutation, flipping a random bit in the string or changing a single base in a stretch of DNA. But for Holland, the real power of genetic algorithms lay in something more complicated: sex.

35%

We can get even fancier by allowing rules for intermediate concepts to evolve, and then chaining these rules at performance time. For example, we could evolve the rules If the e-mail contains the word loan then it’s a scam and If the e-mail is a scam then it’s spam. Since a rule’s consequent is no longer always spam, this requires introducing additional bits in rule strings to represent their consequents. Of course, the computer doesn’t literally use the word scam; it just comes up with some arbitrary bit string to represent the concept, but that’s good enough for our purposes. Sets of rules ...more

36%

What’s clever about genetic algorithms is that each string implicitly contains an exponential number of building blocks, known as schemas, and so the search is a lot more efficient than it seems. This is because every subset of the string’s bits is a schema, representing some potentially fit combination of properties, and a string has an exponential number of subsets.

36%

One of the most important problems in machine learning—and life—is the exploration-exploitation dilemma. If you’ve found something that works, should you just keep doing it? Or is it better to try new things, knowing it could be a waste of time but also might lead to a better solution?

36%

Each time you play, you have to choose between repeating the best move you’ve found so far, which gives you the best payoff, or trying other moves, which gather information that may lead to even better payoffs.

36%

One of Holland’s more remarkable students was John Koza. In 1987, while flying back to California from a conference in Italy, he had a lightbulb moment. Instead of evolving comparatively simple things like If… then… rules and gas pipeline controllers, why not evolve full-blown computer programs? And if that’s the goal, why stick with bit strings as the representation? A program is really a tree of subroutine calls, so better to directly cross over those subtrees than to shoehorn them into bit strings and run the risk of destroying perfectly good subroutines when you cross them over at a random ...more

36%

In genetic programming, as Koza called his method, we cross over two program trees by randomly swapping two of their subtrees. For example, crossing over these two trees at the highlighted nodes yields the correct program for computing T as one of the children:

37%

We can measure a program’s fitness (or lack thereof) by the distance between its output and the correct one on the training data. For example, if the program says an Earth year is three hundred days, that would subtract sixty-five points from its fitness. Starting with a population of random program trees, genetic programming uses crossover, mutation, and survival to gradually evolve better programs until it’s satisfied.

37%

One consequence of crossing over program trees instead of bit strings is that the resulting programs can have any size, making the learning more flexible. The overall tendency is for bloat, however, with larger and larger trees growing as evolution goes on longer (also known as “survival of the fattest”).

37%

Eliminating sex would leave evolutionaries with only mutation to power their engine. If the size of the population is substantially larger than the number of genes, chances are that every point mutation is represented in it, and the search becomes a type of hill climbing: try all possible one-step variations, pick the best one, and repeat. (Or pick several of the best variations, in which case it’s called beam search.) Symbolists, in particular, use this all the time to learn sets of rules, although they don’t think of it as a form of evolution. To avoid getting trapped in local maxima, hill ...more

38%

Evolutionaries and connectionists have something important in common: they both design learning algorithms inspired by nature. But then they part ways. Evolutionaries focus on learning structure; to them, fine-tuning an evolved structure by optimizing parameters is of secondary importance. In contrast, connectionists prefer to take a simple, hand-coded structure with lots of connections and let weight learning do all the work. This is machine learning’s version of the nature versus nurture controversy, and there are good arguments on both sides.

38%

If you start out blindfolded in Kansas, you have no idea which way the Rockies lie, and you’ll wander around for a long time before you bump into their foothills and start climbing. But if you combine evolution with neural learning, something interesting happens. If you’re on flat ground, but not too far from the foothills, neural learning can get you there, and the closer you are to the foothills, the more likely it will. It’s like being able to scan the horizon: it won’t help you in Wichita, but in Denver you’ll see the Rockies in the distance and head that way. Denver now looks a lot fitter ...more

38%

In biology, this is called the Baldwin effect, after J. M. Baldwin, who proposed it in 1896. In Baldwinian evolution, behaviors that are first learned later become genetically hardwired.

38%

Geoff Hinton and Steven Nowlan demonstrated the Baldwin effect in machine learning by using genetic algorithms to evolve neural network structure and observing that fitness increased over time only when individual learning was allowed.

39%

Evolution takes billions of years to learn, and the brain takes a lifetime. Culture is better: I can distill a lifetime of learning into a book, and you can read it in a few hours. But learning algorithms should be able to learn in minutes or seconds. He who learns fastest wins, whether it’s the Baldwin effect speeding up evolution, verbal communication speeding up human learning, or computers discovering patterns at the speed of light. Machine learning is the latest chapter in the arms race of life on Earth, and swifter hardware is only half the equation. The other half is smarter software.

39%

In contrast to the connectionists and evolutionaries, symbolists and Bayesians do not believe in emulating nature. Rather, they want to figure out from first principles what learners should do—and that includes us humans.

39%

This is an instance of a tension that runs throughout much of science and philosophy: the split between descriptive and normative theories, between “this is how it is” and “this is how it should be.” Symbolists and Bayesians like to point out, however, that figuring out how we should learn can also help us to understand how we do learn because the two are presumably not entirely unrelated—far from it. In particular, behaviors that are important for survival and have had a long time to evolve should not be far from optimal.

39%

At heart, Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down.

39%

For Bayesians, learning is “just” another application of Bayes’ theorem, with whole models as the hypotheses and the data as the evidence: as you see more data, some models become more likely and some less, until ideally one model stands out as the clear winner.

40%

Similarly, Bayesianism as we know it was invented by Pierre-Simon de Laplace, a Frenchman who was born five decades after Bayes. Bayes was the preacher who first described a new way to think about chance, but it was Laplace who codified those insights into the theorem that bears Bayes’s name.

40%

At the heart of his explorations in probability was a preoccupation with Hume’s question. For example, how do we know the sun will rise tomorrow? It has done so every day until today, but that’s no guarantee it will continue. Laplace’s answer had two parts. The first is what we now call the principle of indifference, or principle of insufficient reason.

40%

But, Laplace went on, if the past is any guide to the future, every day that the sun rises should increase our confidence that it will continue to do so. After five thousand years, the probability that the sun will rise yet again tomorrow should be very close to one, but not quite there, since we can never be completely certain. From this thought experiment, Laplace derived his so-called rule of succession, which estimates the probability that the sun will rise again after having risen n times as (n + 1) / (n + 2). When n = 0, this is just ½; and as n increases, so does the probability, ...more

This highlight has been truncated due to consecutive passage length restrictions.

41%

Bayes’ theorem as a foundation for statistics and machine learning is bedeviled not just by computational difficulty but also by extreme controversy.

41%

All models are wrong, but some are useful