The Master Algorithm Quotes

Rate this book
Clear rating
The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World by Pedro Domingos
6,417 ratings, 3.74 average rating, 594 reviews
Open Preview
The Master Algorithm Quotes Showing 121-150 of 192
“Like the red pill in The Matrix, the Master Algorithm is the gateway to a different reality: the one you already live in but didn’t know it yet. From dating to work, from self-knowledge to the future of society, from data sharing to war, and from the dangers of AI to the next step in evolution, a new world is taking shape, and machine learning is the key that unlocks it.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Even plain old electricity is a kind of unifier: you can generate it from many different sources—coal, gas, nuclear, hydro, wind, solar—and consume it in an infinite variety of ways. A power station doesn’t know or care how the electricity it produces will be consumed, and your porch light, dishwasher, or brand-new Tesla are oblivious to where their electricity supply comes from. Electricity is the Esperanto of energy. The”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“The breakthrough came in the early 1980s, when Judea Pearl, a professor of computer science at the University of California, Los Angeles, invented a new representation: Bayesian networks. Pearl is one of the most distinguished computer scientists in the world, his methods having swept through machine learning, AI, and many other fields. He won the Turing Award, the Nobel Prize of computer science, in 2012. Pearl realized that it’s OK to have a complex network of dependencies among random variables, provided each variable depends directly on only a few others. We can represent these dependencies with a graph like the ones we saw for Markov chains and HMMs, except now the graph can have any structure (as long as the arrows don’t form closed loops). One of Pearl’s favorite examples is burglar alarms. The alarm at your house should go off if a burglar attempts to break in, but it could also be triggered by an earthquake. (In Los Angeles, where Pearl lives, earthquakes are almost as frequent as burglaries.) If you’re working late one night and your neighbor Bob calls to say he just heard your alarm go off, but your neighbor Claire doesn’t, should you call the police? Here’s the graph of dependencies: If there’s an arrow from one node to another in the graph, we say that the first node is a parent of the second. So Alarm’s parents are Burglary and Earthquake, and Alarm is the sole parent of Bob calls and Claire calls. A Bayesian network is a graph of dependencies like this, together with a table for each variable, giving its probability for each combination of values of its parents. For Burglary and Earthquake we only need one probability each, since they have no parents. For Alarm we need four: the probability that it goes off even if there’s no burglary or earthquake, the probability that it goes off if there’s a burglary and no earthquake, and so on. For Bob calls we need two probabilities (given alarm and given no alarm), and similarly for Claire. Here’s the crucial point: Bob calling depends on Burglary and Earthquake, but only through Alarm. Bob’s call is conditionally independent of Burglary and Earthquake given Alarm, and so is Claire’s. If the alarm doesn’t go off, your neighbors sleep soundly, and the burglar proceeds undisturbed. Also, Bob and Claire are independent given Alarm. Without this independence structure, you’d need to learn 25 = 32 probabilities, one for each possible state of the five variables. (Or 31, if you’re a stickler for details, since the last one can be left implicit.) With the conditional independencies, all you need is 1 + 1 + 4 + 2 + 2 = 10, a savings of 68 percent. And that’s just in this tiny example; with hundreds or thousands of variables, the savings would be very close to 100 percent.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“HMMs are at the heart of speech-recognition systems like Siri. In speech recognition, the hidden states are written words, the observations are the sounds spoken to Siri, and the goal is to infer the words from the sounds. The model has two components: the probability of the next word given the current one, as in a Markov chain, and the probability of hearing various sounds given the word being pronounced. (How exactly to do the inference is a fascinating problem that we’ll turn to after the next section.) Siri aside, you use an HMM every time you talk on your cell phone. That’s because your words get sent over the air as a stream of bits, and the bits get corrupted in transit. The HMM then figures out the intended bits (hidden state) from the ones received (observations), which it should be able to do as long as not too many bits got mangled. HMMs are also a favorite tool of computational biologists. A protein is a sequence of amino acids, and DNA is a sequence of bases. If we want to predict, for example, how a protein will fold into a 3-D shape, we can treat the amino acids as the observations and the type of fold at each point as the hidden state. Similarly, we can use an HMM to identify the sites in DNA where gene transcription is initiated and many other properties.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“PageRank, the algorithm that gave rise to Google, is itself a Markov chain. Larry Page’s idea was that web pages with many incoming links are probably more important than pages with few, and links from important pages should themselves count for more. This sets up an infinite regress, but we can handle it with a Markov chain. Imagine a web surfer going from page to page by randomly following links: the states of this Markov chain are web pages instead of characters, making it a vastly larger problem, but the math is the same. A page’s score is then the fraction of the time the surfer spends on it, or equivalently, his probability of landing on the page after wandering around for a long time. Markov chains turn up everywhere and are one of the most intensively studied topics in mathematics, but they’re still a very limited kind of probabilistic model. We can go one step further with a model like this: The states form a Markov chain, as before, but we don’t get to see them; we have to infer them from the observations. This is called a hidden Markov model, or HMM for short. (Slightly misleading, because it’s the states that are hidden, not the model.)”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“models like this are a key component of machine-translation systems, like Google Translate, which lets you see the whole web in English (or almost), regardless of the language the pages were originally written in.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“In 1913, on the eve of World War I, the Russian mathematician Andrei Markov published a paper applying probability to, of all things, poetry. In it, he modeled a classic of Russian literature, Pushkin’s Eugene Onegin, using what we now call a Markov chain. Rather than assume that each letter was generated at random independently of the rest, he introduced a bare minimum of sequential structure: he let the probability of each letter depend on the letter immediately preceding it. He showed that, for example, vowels and consonants tend to alternate, so if you see a consonant, the next letter (ignoring punctuation and white space) is much more likely to be a vowel than it would be if letters were independent. This may not seem like much, but in the days before computers, it required spending hours manually counting characters, and Markov’s idea was quite new. If Voweli is a Boolean variable that’s true if the ith letter of Eugene Onegin is a vowel and false if it’s a consonant, we can represent Markov’s model with a chain-like graph like this, with an arrow between two nodes indicating a direct dependency between the corresponding variables: Markov assumed (wrongly but usefully) that the probabilities are the same at every position in the text. Thus we need to estimate only three probabilities: P(Vowel1 = True), P(Voweli+1 = True | Voweli = True), and P(Voweli+1 = True | Voweli = False). (Since probabilities sum to one, from these we can immediately obtain P(Vowel1 = False), etc.) As with Naïve Bayes, we can have as many variables as we want without the number of probabilities we need to estimate going through the roof, but now the variables actually depend on each other.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Peter Norvig, director of research at Google, told me at one point that it was the most widely used learner there, and Google uses machine learning in every nook and cranny of what it does. It’s not hard to see why Naïve Bayes would be popular among Googlers. Surprising accuracy aside, it scales great; learning a Naïve Bayes classifier is just a matter of counting how many times each attribute co-occurs with each class and takes barely longer than reading the data from disk. You could even use Naïve Bayes, tongue-in-cheek, on a much larger scale than Google’s: to model the whole universe.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“In reality, a doctor doesn’t diagnose the flu just based on whether you have a fever; she takes a whole bunch of symptoms into account, including whether you have a cough, a sore throat, a runny nose, a headache, chills, and so on. So what we really need to compute is P(flu | fever, cough, sore throat, runny nose, headache, chills, … ). By Bayes’ theorem, we know that this is proportional to P(fever, cough, sore throat, runny nose, headache, chills, …| flu). But now we run into a problem. How are we supposed to estimate this probability? If each symptom is a Boolean variable (you either have it or you don’t) and the doctor takes n symptoms into account, a patient could have 2n possible combinations of symptoms.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Bayes’ theorem is useful because what we usually know is the probability of the effects given the causes, but what we want to know is the probability of the causes given the effects. For example, we know what percentage of flu patients have a fever, but what we really want to know is how likely a patient with a fever is to have the flu. Bayes’ theorem lets us go from one to the other. Its significance extends far beyond that, however. For Bayesians, this innocent-looking formula is the F = ma of machine learning, the foundation from which a vast number of results and applications flow. And whatever the Master Algorithm is, it must be “just” a computational implementation of Bayes’ theorem. I put just in quotes because implementing Bayes’ theorem on a computer turns out to be fiendishly hard for all but the simplest problems, for reasons that we’re about to see.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Bayes’ theorem says that P(cause | effect) = P(cause) × P(effect | cause) / P(effect). Replace cause by A and effect by B and omit the multiplication sign for brevity, and you get the ten-foot formula in the cathedral.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“At the heart of his explorations in probability was a preoccupation with Hume’s question. For example, how do we know the sun will rise tomorrow? It has done so every day until today, but that’s no guarantee it will continue. Laplace’s answer had two parts. The first is what we now call the principle of indifference, or principle of insufficient reason. We wake up one day—at the beginning of time, let’s say, which for Laplace was five thousand years or so ago—and after a beautiful afternoon, we see the sun go down. Will it come back? We’ve never seen the sun rise, and there is no particular reason to believe it will or won’t. Therefore we should consider the two scenarios equally likely and say that the sun will rise again with a probability of one-half. But, Laplace went on, if the past is any guide to the future, every day that the sun rises should increase our confidence that it will continue to do so. After five thousand years, the probability that the sun will rise yet again tomorrow should be very close to one, but not quite there, since we can never be completely certain. From this thought experiment, Laplace derived his so-called rule of succession, which estimates the probability that the sun will rise again after having risen n times as (n + 1) / (n + 2). When n = 0, this is just ½; and as n increases, so does the probability, approaching 1 when n approaches infinity.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Bayesianism as we know it was invented by Pierre-Simon de Laplace, a Frenchman who was born five decades after Bayes. Bayes was the preacher who first described a new way to think about chance, but it was Laplace who codified those insights into the theorem that bears Bayes’s name.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“At heart, Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down. For example, if you test positive for AIDS, your probability of having it goes up. Things get more interesting when you have many pieces of evidence, such as the results of multiple tests. To combine them all without suffering a combinatorial explosion, we need to make simplifying assumptions. Things get even more interesting when we consider many hypotheses at once, such as all the different possible diagnoses for a patient. Computing the probability of each disease from the patient’s symptoms in a reasonable amount of time can take a lot of smarts. Once we know how to do all these things, we’ll be ready to learn the Bayesian way. For Bayesians, learning is “just” another application of Bayes’ theorem, with whole models as the hypotheses and the data as the evidence: as you see more data, some models become more likely and some less, until ideally one model stands out as the clear winner.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“For the hardest problems—the ones we really want to solve but haven’t been able to, like curing cancer—pure nature-inspired approaches are probably too uninformed to succeed, even given massive amounts of data. We can in principle learn a complete model of a cell’s metabolic networks by a combination of structure search, with or without crossover, and parameter learning via backpropagation, but there are too many bad local optima to get stuck in. We need to reason with larger chunks, assembling and reassembling them as needed and using inverse deduction to fill in the gaps. And we need our learning to be guided by the goal of optimally diagnosing cancer and finding the best drugs to cure it. Optimal learning is the Bayesians’ central goal, and they are in no doubt that they’ve figured out how to reach it. This way, please …”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Inside Hod Lipson’s Creative Machines Lab at Cornell University, fantastically shaped robots are learning to crawl and fly, probably even as you read this. One looks like a slithering tower of rubber bricks, another like a helicopter with dragonfly wings, yet another like a shape-shifting Tinkertoy. These robots were not designed by any human engineer but created by evolution, the same process that gave rise to the diversity of life on Earth. Although the robots initially evolve inside a computer simulation, once they look proficient enough to make it in the real world, solid versions are automatically fabricated by 3-D printing. These are not yet ready to take over the world, but they’ve come a long way from the primordial soup of simulated parts they started with. The algorithm that evolved these robots was invented by Charles Darwin in the nineteenth century. He didn’t think of it as an algorithm at the time, partly because a key subroutine was still missing. Once James Watson and Francis Crick provided it in 1953, the stage was set for the second coming of evolution: in silico instead of in vivo, and a billion times faster. Its prophet was a ruddy-faced, perpetually grinning midwesterner by the name of John Holland.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“The S curve is not just important as a model in its own right; it’s also the jack-of-all-trades of mathematics. If you zoom in on its midsection, it approximates a straight line. Many phenomena we think of as linear are in fact S curves, because nothing can grow without limit. Because of relativity, and contra Newton, acceleration does not increase linearly with force, but follows an S curve centered at zero. So does electric current as a function of voltage in the resistors found in electronic circuits, or in a light bulb (until the filament melts, which is itself another phase transition). If you zoom out from an S curve, it approximates a step function, with the output suddenly changing from zero to one at the threshold. So depending on the input voltages, the same curve represents the workings of a transistor in both digital computers and analog devices like amplifiers and radio tuners. The early part of an S curve is effectively an exponential, and near the saturation point it approximates exponential decay. When someone talks about exponential growth, ask yourself: How soon will it turn into an S curve? When will the population bomb peter out, Moore’s law lose steam, or the singularity fail to happen? Differentiate an S curve and you get a bell curve: slow, fast, slow becomes low, high, low. Add a succession of staggered upward and downward S curves, and you get something close to a sine wave. In fact, every function can be closely approximated by a sum of S curves: when the function goes up, you add an S curve; when it goes down, you subtract one. Children’s learning is not a steady improvement but an accumulation of S curves. So is technological change. Squint at the New York City skyline and you can see a sum of S curves unfolding across the horizon, each as sharp as a skyscraper’s corner. Most importantly for us, S curves lead to a new solution to the credit-assignment problem. If the universe is a symphony of phase transitions, let’s model it with one. That’s what the brain does: it tunes the system of phase transitions inside to the one outside. So let’s replace the perceptron’s step function with an S curve and see what happens.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“This curve, which looks like an elongated S, is variously known as the logistic, sigmoid, or S curve. Peruse it closely, because it’s the most important curve in the world. At first the output increases slowly with the input, so slowly it seems constant. Then it starts to change faster, then very fast, then slower and slower until it becomes almost constant again. The transfer curve of a transistor, which relates its input and output voltages, is also an S curve. So both computers and the brain are filled with S curves. But it doesn’t end there. The S curve is the shape of phase transitions of all kinds: the probability of an electron flipping its spin as a function of the applied field, the magnetization of iron, the writing of a bit of memory to a hard disk, an ion channel opening in a cell, ice melting, water evaporating, the inflationary expansion of the early universe, punctuated equilibria in evolution, paradigm shifts in science, the spread of new technologies, white flight from multiethnic neighborhoods, rumors, epidemics, revolutions, the fall of empires, and much more. The Tipping Point could equally well (if less appealingly) be entitled The S Curve. An earthquake is a phase transition in the relative position of two adjacent tectonic plates. A bump in the night is just the sound of the microscopic tectonic plates in your house’s walls shifting, so don’t be scared. Joseph Schumpeter said that the economy evolves by cracks and leaps: S curves are the shape of creative destruction. The effect of financial gains and losses on your happiness follows an S curve, so don’t sweat the big stuff. The probability that a random logical formula is satisfiable—the quintessential NP-complete problem—undergoes a phase transition from almost 1 to almost 0 as the formula’s length increases. Statistical physicists spend their lives studying phase transitions.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Hebb’s rule, as it has come to be known, is the cornerstone of connectionism. Indeed, the field derives its name from the belief that knowledge is stored in the connections between neurons. Donald Hebb, a Canadian psychologist, stated it this way in his 1949 book The Organization of Behavior: “When an axon of cell A is near enough cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” It’s often paraphrased as “Neurons that fire together wire together.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Symbolist machine learning is an offshoot of the knowledge engineering school of AI. In the 1970s, so-called knowledge-based systems scored some impressive successes, and in the 1980s they spread rapidly, but then they died out. The main reason they did was the infamous knowledge acquisition bottleneck: extracting knowledge from experts and encoding it as rules is just too difficult, labor-intensive, and failure-prone to be viable for most problems. Letting the computer automatically learn to, say, diagnose diseases by looking at databases of past patients’ symptoms and the corresponding outcomes turned out to be much easier than endlessly interviewing doctors. Suddenly, the work of pioneers like Ryszard Michalski, Tom Mitchell, and Ross Quinlan had a new relevance, and the field hasn’t stopped growing since. (Another important problem was that knowledge-based systems had trouble dealing with uncertainty, of which more in Chapter 6.)”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Decision trees instead ensure a priori that each instance will be matched by exactly one rule. This will be the case if each pair of rules differs in at least one attribute test, and such a rule set can be organized into a decision tree. For example, consider these rules: If you’re for cutting taxes and pro-life, you’re a Republican. If you’re against cutting taxes, you’re a Democrat. If you’re for cutting taxes, pro-choice, and against gun control, you’re an independent. If you’re for cutting taxes, pro-choice, and pro-gun control, you’re a Democrat. These can be organized into the following decision tree: A decision tree is like playing a game of twenty questions with an instance. Starting at the root, each node asks about the value of one attribute, and depending on the answer, we follow one or another branch. When we arrive at a leaf, we read off the predicted concept. Each path from the root to a leaf corresponds to a rule. If this reminds you of those annoying phone menus you have to get through when you call customer service, it’s not an accident: a phone menu is a decision tree. The computer on the other end of the line is playing a game of twenty questions with you to figure out what you want, and each menu is a question. According to the decision tree above, you’re either a Republican, a Democrat, or an independent; you can’t be more than one, or none of the above. Sets of concepts with this property are called sets of classes, and the algorithm that predicts them is a classifier. A single concept implicitly defines two classes: the concept itself and its negation. (For example, spam and nonspam.) Classifiers are the most widespread form of machine learning. We can learn decision trees using a variant of the “divide and conquer” algorithm. First we pick an attribute to test at the root. Then we focus on the examples that went down each branch and pick the next test for those. (For example, we check whether tax-cutters are pro-life or pro-choice.) We repeat this for each new node we induce until all the examples in a branch have the same class, at which point we label that branch with the class.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“If your learner’s test-set accuracy disappoints, you need to diagnose the problem. Was it blindness or hallucination? In machine learning, the technical terms for these are bias and variance. A clock that’s always an hour late has high bias but low variance. If instead the clock alternates erratically between fast and slow but on average tells the right time, it has high variance but low bias. Suppose you’re down at the pub with some friends, drinking and playing darts. Unbeknownst to them, you’ve been practicing for years, and you’re a master of the game. All your darts go straight to the bull’s-eye. You have low bias and low variance, which is shown in the bottom left corner of this diagram:”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Rather, you take the data you have and randomly divide it into a training set, which you give to the learner, and a test set, which you hide from it and use to verify its accuracy. Accuracy on held-out data is the gold standard in machine learning. You can write a paper about a great new learning algorithm you’ve invented, but if your algorithm is not significantly more accurate than previous ones on held-out data, the paper is not publishable. Accuracy on previously unseen data is a pretty stringent test; so much so, in fact, that a lot of science fails it. That does not make it useless, because science is not just about prediction; it’s also about explanation and understanding. But ultimately, if your models don’t make accurate predictions on new data, you can’t be sure you’ve truly understood or explained the underlying phenomena. And for machine learning, testing on unseen data is indispensable because it’s the only way to tell whether the learner has overfit or not.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Harvard’s Leslie Valiant received the Turing Award, the Nobel Prize of computer science, for inventing this type of analysis, which he describes in his book entitled, appropriately enough, Probably Approximately Correct.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Learning is forgetting the details as much as it is remembering the important parts. Computers are the ultimate idiot savants: they can remember everything with no trouble at all, but that’s not what we want them to do. The problem is not limited to memorizing instances wholesale. Whenever a learner finds a pattern in the data that is not actually true in the real world, we say that it has overfit the data. Overfitting is the central problem in machine learning. More papers have been written about it than about any other topic.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“The power of rule sets is a double-edged sword. On the upside, you know you can always find a rule set that perfectly matches the data. But before you start feeling lucky, realize that you’re at severe risk of finding a completely meaningless one. Remember the “no free lunch” theorem: you can’t learn without knowledge. And assuming that the concept can be defined by a set of rules is tantamount to assuming nothing.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“A conjunctive concept is what Tolstoy had in mind when he wrote the opening sentence of Anna Karenina: “All happy families are alike; each unhappy family is unhappy in its own way.” The same is true of individuals. To be happy, you need health, love, friends, money, a job you like, and so on. Take any of these away, and misery ensues.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“In the meantime, the practical consequence of the “no free lunch” theorem is that there’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
“Are you a rationalist or an empiricist? Rationalists believe that the senses deceive and that logical reasoning is the only sure path to knowledge. Empiricists believe that all reasoning is fallible and that knowledge must come from observation and experimentation. The French are rationalists; the Anglo-Saxons (as the French call them) are empiricists. Pundits, lawyers, and mathematicians are rationalists; journalists, doctors, and scientists are empiricists.”
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World