The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
Rate it:
3%
Flag icon
Symbolists view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic. Connectionists reverse engineer the brain and are inspired by neuroscience and physics. Evolutionaries simulate evolution on the computer and draw on genetics and evolutionary biology. Bayesians believe learning is a form of probabilistic inference and have their roots in statistics. Analogizers learn by extrapolating from similarity judgments and are influenced by psychology and mathematical optimization.
4%
Flag icon
Believe it or not, every algorithm, no matter how complex, can be reduced to just these three operations: AND, OR, and NOT.
5%
Flag icon
Scientists make theories, and engineers make devices. Computer scientists make algorithms, which are both theories and devices.
5%
Flag icon
Every algorithm has an input and an output: the data goes into the computer, the algorithm does what it will with it, and out comes the result. Machine learning turns this around: in goes the data and the desired result and out comes the algorithm that turns one into the other.
6%
Flag icon
Machine-learning experts (aka machine learners) are an elite priesthood even among computer scientists. Many computer scientists, particularly those of an older generation, don’t understand machine learning as well as they’d like to. This is because computer science has traditionally been all about thinking deterministically, but machine learning requires thinking statistically. If a rule for, say, labeling e-mails as spam is 99 percent accurate, that does not mean it’s buggy; it may be the best you can do and good enough to be useful. This difference in thinking is a large part of why ...more
12%
Flag icon
The Master Algorithm is for induction, the process of learning, what the Turing machine is for deduction.
20%
Flag icon
If all conjunctions of two factors fail, you can try all conjunctions of any number of factors. Machine learners and psychologists call these “conjunctive concepts.”
20%
Flag icon
Dictionary definitions are conjunctive concepts: a chair has a seat and a back and some number of legs. Remove any of these and it’s no longer a chair.
22%
Flag icon
Overfitting happens when you have too many hypotheses and not enough data to tell them apart.
24%
Flag icon
Was it blindness or hallucination? In machine learning, the technical terms for these are bias and variance.
24%
Flag icon
clock that’s always an hour late has high bias but low variance. If instead the clock alternates erratically between fast and slow but on average tells the right time, it has high variance but low bias.
24%
Flag icon
If it keeps making the same mistakes, the problem is bias, and you need a more flexible learner (or just a different one). If there’s no pattern to the mistakes, the problem is variance, and you want to either try a less flexible learner or get more data.
39%
Flag icon
powerful than you’d guess from its everyday uses. At heart, Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down.
41%
Flag icon
A learner that uses Bayes’ theorem and assumes the effects are independent given the cause is called a Naïve Bayes classifier.
41%
Flag icon
But machine learning is the art of making false assumptions and getting away with it.
41%
Flag icon
As the statistician George Box famously put it: “All models are wrong, but some are useful.”
42%
Flag icon
The states form a Markov chain, as before, but we don’t get to see them; we have to infer them from the observations. This is called a hidden Markov model, or HMM for short.
43%
Flag icon
Pearl realized that it’s OK to have a complex network of dependencies among random variables, provided each variable depends directly on only a few others. We can represent these dependencies with a graph like the ones we saw for Markov chains and HMMs, except now the graph can have any structure (as long as the arrows don’t form closed loops).
46%
Flag icon
A Markov network is a set of features and corresponding weights, which together define a probability distribution.
50%
Flag icon
In fact, no learner is immune to the curse of dimensionality. It’s the second worst problem in machine learning, after overfitting. The term curse of dimensionality was coined by Richard Bellman, a control theorist, in the fifties. He observed that control algorithms that worked fine in three dimensions became hopelessly inefficient in higher-dimensional spaces, such as when you want to control every joint in a robot arm or every knob in a chemical plant. But in machine learning the problem is more than just computational cost—it’s that learning itself becomes harder and harder as the ...more
57%
Flag icon
This direction—known as the first principal component of the data—is also the direction along which the spread of the data is greatest.
57%
Flag icon
Principal-component analysis (PCA), as this process is known, is one of the key tools in the scientist’s toolkit.
57%
Flag icon
Applying PCA to congressional votes and poll data shows that, contrary to popular belief, politics is not mainly about liberals versus conservatives. Rather, people differ along two main dimensions: one for economic issues and one for social ones. Collapsing these into a single axis mixes together populists and libertarians, who are polar opposites, and creates the illusion of lots of moderates in the middle. Trying to appeal to them is an unlikely winning strategy. On the other hand, if liberals and libertarians overcame their mutual aversion, they could ally themselves on social issues, ...more
57%
Flag icon
One of the most popular algorithms for nonlinear dimensionality reduction, called Isomap, does just this. It connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network and finds the reduced coordinates that best approximate these distances.
58%
Flag icon
Edward Thorndike called this the law of effect: actions that lead to pleasure are more likely to be repeated in the future; actions that lead to pain, less so.
65%
Flag icon
P is a probability, w is a vector of weights (notice it’s in boldface), n is a vector of numbers, and their dot product • is exponentiated and divided by Z, the sum of all products. If we let the first component of n be one if the first feature of the image is true and zero otherwise, and so on, w•n is just a shorthand for the weighted sum of features we’ve been talking about all along.
75%
Flag icon
People worry that computers will get too smart and take over the world, but the real problem is that they’re too stupid and they’ve already taken over the world.
77%
Flag icon
these, the closest in content to this book is, not coincidentally, the one I teach (www.coursera.org/course/machlearning). Two other options are Andrew Ng’s course (www.coursera.org/course/ml) and Yaser Abu-Mostafa’s (http://work.caltech.edu/telecourse.html). The next step is to read a textbook. The closest to this book, and one of the most accessible, is Tom Mitchell’s Machine Learning* (McGraw-Hill, 1997). More up-to-date, but also more mathematical, are Kevin Murphy’s Machine Learning: A Probabilistic Perspective* (MIT Press, 2012), Chris Bishop’s Pattern Recognition and Machine Learning* ...more
This highlight has been truncated due to consecutive passage length restrictions.
Beau D Lyddon
resources
77%
Flag icon
A current one is Artificial Intelligence: A Modern Approach, by Stuart Russell and Peter Norvig (3rd ed., Prentice Hall, 2010).