More on this book
Community
Kindle Notes & Highlights
Read between
September 22 - October 13, 2018
Evolution is the ultimate example of how much a simple learning algorithm can achieve given enough data.
According to Bayesian statisticians, it’s the only correct way to turn data into knowledge. If they’re right, either Bayes’ theorem is the Master Algorithm or it’s the engine that drives it.
If there’s a limit to what Bayes can learn, we haven’t found it yet.
Doubt is in order when something looks like a silver bullet.
Marvin Minsky, an MIT professor and AI pioneer, is a prominent member of this camp. Minsky is not just skeptical of machine learning as an alternative to knowledge engineering, he’s skeptical of any unifying ideas in AI.
“No matter how smart your algorithm, there are some things it just can’t learn.” Outside of AI and cognitive science, the most common objections to machine learning are variants of this claim. Nassim Taleb hammered on it forcefully in his book The Black Swan. Some events are simply not predictable. If you’ve only ever seen white swans, you think the probability of ever seeing a black one is zero. The financial meltdown of 2008 was a “black swan.”
If I’m the expert on X at company Y, I don’t like to be overridden by some guy with data. There’s a saying in industry: “Listen to your customers, not to the HiPPO,” HiPPO being short for “highest paid person’s opinion.” If you want to be tomorrow’s authority, ride the data, don’t fight it.
The empiricist prefers to try things and see how they turn out.
Replace “god” with “learning algorithm” and “eternal life” with “accurate prediction,” and you have the “no free lunch” theorem. Pick your favorite learner. (We’ll see many in this book.) For every world where it does better than random guessing, I, the devil’s advocate, will deviously construct one where it does worse by the same amount.
Don’t give up on machine learning or the Master Algorithm just yet, though. We don’t care about all possible worlds, only the one we live in. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing.
Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.
Whenever a learner finds a pattern in the data that is not actually true in the real world, we say that it has overfit the data. Overfitting is the central problem in machine learning. More papers have been written about it than about any other topic.
Thus a good learner is forever walking the narrow path between blindness and hallucination.
Humans are not immune to overfitting, either. You could even say that it’s the root cause of a lot of our evils.
It’s even been said that data mining means “torturing the data until it confesses.”
Overfitting happens when you have too many hypotheses and not enough data to tell them apart.
the number of hypotheses grows exponentially with the number of attributes. Exponential growth is a scary thing.
Bottom line: learning is a race between the amount of data you have and the number of hypotheses you consider.
Simple: you don’t believe anything until you’ve verified it on data that the learner didn’t see.
Even test-set accuracy is not foolproof. According to legend, in an early military application a simple learner detected tanks with 100 percent accuracy in both the training set and the test set, each consisting of one hundred images. Amazing—or suspicious? Turns out all the tank images were lighter than the nontank ones, and that’s all the learner was picking up.
One method is to use statistical significance tests to make sure the patterns we’re seeing are really there.
Another popular method is to prefer simpler hypotheses. The “divide and conquer” algorithm implicitly prefers simpler rules because it stops adding conditions to a rule as soon as it covers only positive examples and stops adding rules as soon as all positive examples are covered.
Simple theories are preferable because they incur a lower cognitive cost (for us) and a lower computational cost (for our algorithms), not because we necessarily expect them to be more accurate.
Most of the time these mutations cause the cell to die silently, but sometimes the cell starts to grow and divide uncontrollably and a cancer is born.
Curing cancer means stopping the bad cells from reproducing without harming the good ones.
Symbolism is the shortest path to the Master Algorithm. It doesn’t require us to figure out how evolution or the brain works, and it avoids the mathematical complexities of Bayesianism. Sets of rules and decision trees are easy to understand, so we know what the learner is up to.
connectionist representations are distributed: each concept is represented by many neurons, and each neuron participates in representing many different concepts. Neurons that excite one another form what Hebb called a cell assembly. Concepts and memories are represented in the brain by cell assemblies. Each of these can include neurons from different brain regions and overlap with other assemblies.
But then the perceptron hit a brick wall. The knowledge engineers were irritated by Rosenblatt’s claims and envious of all the attention and funding neural networks, and perceptrons in particular, were getting. One of them was Marvin Minsky, a former classmate of Rosenblatt’s at the Bronx High School of Science and by then the leader of the AI group at MIT. (Ironically, his PhD had been on neural networks, but he had grown disillusioned with them.) In 1969, Minsky and his colleague Seymour Papert published Perceptrons, a book detailing the shortcomings of the eponymous algorithm, with example
...more
One of the futurist Paul Saffo’s rules of forecasting is: look for the S curves. When you can’t get the temperature in the shower just right—first it’s too cold, and then it quickly shifts to too hot—blame the S curve. When you make popcorn, watch the S curve’s progress: at first nothing happens, then a few kernels pop, then a bunch more, then the bulk of them in a sudden burst of fireworks, then a few more, and then it’s ready to eat. Every motion of your muscles follows an S curve: slow, then fast, then slow again.
Your eyes move in S curves, fixating on one thing and then another, along with your consciousness. Mood swings are phase transitions. So are birth, adolescence, falling in love, getting married, getting pregnant, getting a job, losing it, moving to a new town, getting promoted, retiring, and dying. The universe is a vast symphony of phase transitions, from the cosmic to the microscopic, from the mundane to the life changing.
acceleration does not increase linearly with force, but follows an S curve centered at zero.
Differentiate an S curve and you get a bell curve: slow, fast, slow becomes low, high, low. Add a succession of staggered upward and downward S curves, and you get something close to a sine wave.
Better still, a local minimum may in fact be preferable because it’s less likely to prove to have overfit our data than the global one.
stuck in a local optimum you have to be stuck in every dimension, so it’s more difficult to get stuck in many dimensions than it is in three.
Beware of attaching too much meaning to the weights backprop finds, however. Remember that there are probably many very different ones that are just as good.
A typical investment fund would train a separate network for each of a large number of stocks, let the networks pick the most promising ones, and then have human analysts decide which of those to invest in. A few funds, however, went all the way and let the learners themselves buy and sell. Exactly how all these fared is a closely guarded secret, but it’s probably not an accident that machine learners keep disappearing into hedge funds at an alarming rate.
In truth, connectionists have made genuine progress. One of the protagonists of this latest twist in the connectionist roller coaster is an unassuming little device called an autoencoder.
The nervous system of the C. elegans worm consists of only 302 neurons and was completely mapped in 1986, but we still have only a fragmentary understanding of what it does. We need higher-level concepts to make sense of the morass of low-level details, weeding out the ones that are specific to wetware or just quirks of evolution. We don’t build airplanes by reverse engineering feathers, and airplanes don’t flap their wings. Rather, airplane designs are based on the principles of aerodynamics, which all flying objects must obey. We still do not understand those analogous principles of thought.
Neural networks are not compositional, and compositionality is a big part of human cognition.
No one is sure why sex is pervasive in nature, either. Several theories have been proposed, but none is widely accepted.
In this view, organisms are in a perpetual arms race with parasites, and sex helps keep the population varied, so that no single germ can infect all of it. If this is the answer, then sex is irrelevant to machine learning, at least until learned programs have to vie with computer viruses for processor time and memory.
sex optimizes not fitness but what they call mixability: a gene’s ability to do well on average when combined with other genes. This can be useful when the fitness function is either not known or not constant, as in natural selection, but in machine learning and optimization, hill climbing tends to do better.
Genetic programmers started their own conference, which merged with the genetic algorithms conference to form GECCO, the Genetic and Evolutionary Computing Conference. For its part, the machine-learning mainstream largely forgot them.
The Master Algorithm is neither genetic programming nor backprop, but it has to include the key elements of both: structure learning and weight learning. In the conventional view, nature does its part first—evolving a brain—and then nurture takes it from there, filling the brain with information.
In Baldwinian evolution, behaviors that are first learned later become genetically hardwired. If dog-like mammals can learn to swim, they have a better chance to evolve into seals—as they did—than if they drown.
the goal of machine learning is to find the best possible learning algorithm, by any means available, and evolution and the brain are unlikely to provide it. The products of evolution have many obvious faults.
The molecular biology of living cells is such a mess that molecular biologists often quip that only people who don’t know any of it could believe in intelligent design.
The controversy is in how Bayesians obtain the probabilities that go into it and what those probabilities mean.
A learner that uses Bayes’ theorem and assumes the effects are independent given the cause is called a Naïve Bayes classifier.
science often progresses by making things as simple as possible, and then some.