Mark Gerstein’s Kindle Notes & Highlights for Why Machines Learn: The Elegant Math Behind Modern AI

But what exactly is a perceptron, and how does it learn? In its simplest form, a perceptron is an augmented McCulloch-Pitts neuron imbued with a learning algorithm. What follows is an example with two inputs. Note that each input is being multiplied by its corresponding weight. (There is also an extra input, b, the reason for which will soon become clear.)

10%

The perceptron begins by initializing the weight vector to zero and then checks to see if the chosen weight vector correctly classifies each data point one at a time. This is done by first calculating the value of the expression ywTx for one data point. If the weights are correct for the data point x and the expression wTx evaluates to a negative value, it means that x lies to the left of the hyperplane; it also means that x is classified with the label y = -1. So, if the expected value of y is -1 and the expression wTx evaluates to a negative number, their product will be positive. Similarly, ...more

This highlight has been truncated due to consecutive passage length restrictions.

11%

THE ALGORITHM: THE PERCEPTRON UPDATE RULE (This rule and proof adapted from Weinberger’s lecture.) Step 1. Initialize the weight vector to zero: set w = 0. Step 2. For each data point x in the training dataset, do the following: Step 2a if ywTx ≤ 0: the weight vector is wrong, so update it: wnew= wold + yx Step 3. If there were no updates to the weight vector in step 2, terminate, otherwise go to step 2 and iterate over all the data points once again.

14%

Functions like the one depicted above, which have a single, well-defined minimum, are also called convex functions. When we find the bottom of the bowl, technically, we have found the “global” minimum of the function.

14%

The surface is called a hyperbolic paraboloid. Notice that it looks like a saddle: part convex surface and part concave. In the figure above, we start descending from our location and ostensibly reach a place where the gradient is zero. But this is an unstable place. It’s called a saddle point. One false step, and you’ll tumble down the surface. This function has no global or local minimum. Also, the initial starting point can dictate whether you even come close to the saddle point while descending.

17%

Each element of this vector will be an analytic expression that can be calculated using the rules of calculus. Once you have the expressions, you just plug in the current values for the weights, and you get the gradient, which you can then use to calculate the new weights. The problem: You need calculus, and while our gradient has only three elements, in practice, it can have elements that number in the tens, hundreds, thousands, or even more. Widrow and Hoff were after something simpler. This is what they came up with: wnew = wold + μ(-∇est) Instead of calculating the entire gradient, they ...more

18%

“Suppose you are on a game show, and you’re given the choice of three doors. Behind one is a car; behind the others, goats. You pick a door, say, No. 1, and the host, who knows what’s behind the doors, opens another door, say, No. 3, which has a goat. He then says to you, ‘Do you want to pick No. 2?’ Is it to your advantage to switch your choice?” The person playing the game has a quandary. Do they switch their choice from door No. 1 to door No. 2? Is there any benefit to doing so, in that they will increase their odds of choosing the door hiding the car? Before we look at vos Savant’s answer, ...more

18%

Mathematician Keith Devlin gave another take on it. Put a mental box around your choice, Door No. 1, and another box around Doors No. 2 and 3 combined. The box around Door No. 1 has a one-third probability associated with it, and the box around Doors No. 2 and 3 has a two-thirds probability associated with it, in terms of containing the car. Now the host opens one of the doors inside the bigger box to reveal a goat. The two-thirds probability of the bigger box shifts to the unopened door. To switch is the correct answer.

19%

The theorem allows us to calculate the probability of a hypothesis H (you have the disease) being true, given evidence E (the test is positive). This is written as P(H|E): the probability of H given E. Bayes’s theorem says: Let’s unpack the various terms on the right-hand side of the equation. P(H): The probability that someone picked at random from the population has the disease. This is also called the prior probability (before taking any evidence into account). In our case, we can assume it is 1⁄1000, or 0.001, based on what’s been observed in the general population thus far.

19%

P(E|H): The probability of the evidence given the hypothesis or, to put it simply, the probability of testing positive if you have the disease. We know this. It’s the sensitivity of the test: 0.9. P(E): The probability of testing positive. This is the sum of the probabilities of two different ways someone can test positive given the background rate of the disease in the population. The first is the prior probability that one has the disease (0.001) multiplied by the probability that one tests positive (0.9), which equals 0.0009. The second is the prior probability that one doesn’t have the ...more

21%

Now, if you knew the underlying distribution, you could very simply figure out the probability that the person was at risk given x and the probability that the person was not at risk given x (where x refers to the vector for a single person or an instance of X). P (y = at-risk | x) and P (y = not-at-risk | x) Then, one way to make a prediction would be to choose the category that had the higher probability. Later in the chapter, we’ll come to just how you can do this (it involves using Bayes’s theorem), but for now, all we need to appreciate is that this is the best an ML algorithm can do, ...more

22%

In the first method, given data, the ML algorithm figures out the best θ, for some choice of distribution type (Bernoulli or Gaussian or something else), which maximizes the likelihood of seeing the data, D. In other words, you are estimating the best underlying distribution, with parameter θ, such that if you were to sample from that distribution, you would maximize the likelihood of observing the labeled data you already had in hand. Not surprisingly, this method is called maximum likelihood estimation (MLE). It maximizes P (D | θ), the probability of observing D given θ, and is loosely ...more

This highlight has been truncated due to consecutive passage length restrictions.

22%

most likely, without having seen the data. This is the prior probability distribution. Bayesian statisticians argue that it’s entirely reasonable to have a prior belief for the value of θ.

24%

This simple classifier that we just analyzed, with only one feature of penguins (bill depth) and two types of penguins, is called the Bayes optimal classifier. It’s the best any ML algorithm can ever do. And in our analysis, the result was contingent upon knowing or estimating the underlying distribution of data.

29%

If you had access to the two underlying distributions, then given a new, unclassified penguin and its bill depth, you could use the Bayes theorem to simply calculate the probability that the penguin is an Adélie given the bill depth and the probability that the penguin is a Gentoo given the bill depth. Let’s say that, for some given bill depth, the probability of the penguin being an Adélie turns out to be 0.75 and of it being a Gentoo, 0.25. For the Bayes optimal classifier, the higher probability wins each time. The algorithm will always classify the new penguin as an Adélie, even though ...more

30%

Step 1. Store all instances of sample data. Each penguin is a vector [x1, x2], where x1=bill depth and x2=bill length. The entire dataset is stored in a matrix X, where X has m rows (number of penguins) and n columns (number of features). Each penguin is also associated with a label y, which is equal to -1 (Adélie) or 1 (Gentoo). So, y, which stores all the corresponding labels, is an m-dimensional vector. Step 2. Given a new data point, representing an unclassified penguin, in the form of a vector x with elements for bill depth and bill length [x1, x2], do the following: Calculate the ...more

30%

Perhaps the most important feature of the k-NN algorithm is that it’s a so-called nonparametric model. Cast your mind back to the perceptron. Once you have a trained model, using some initial training dataset, the perceptron is simply characterized by its weight vector, w. The number of elements of this vector equals the number of parameters that define the perceptron. This number is not dependent on the amount of training data. You could train the perceptron with one hundred instances of data or a million, but at the end of the training session, the hyperplane would still be defined by w. A ...more

30%

As Julie Delon of the Université Paris–Descartes says in her talks on the subject, “In high dimensional spaces, nobody can hear you scream.”

31%

We know that in 3D space a large fraction of the volume of the cube is taken up by the enclosed sphere. This fraction starts decreasing as we move up in dimensions. We saw that as the number of dimensions tends to infinity, the volume of the unit hypersphere tends to zero. This means that the internal volume of the hypercube taken up by the unit hypersphere vanishes, most of the volume of the hypercube ends up near the vertices, and all the vertices are equally far away from each other. What’s all this got to do with the k-NN algorithm and machine learning? Well, let’s say that data points ...more

32%

To recap, a matrix is a rectangular array of numbers. Generically, an m × n matrix has m rows and n columns.

32%

If you look carefully, a matrix-vector multiplication involves taking the dot product of each row of the matrix with the column vector.

34%

It’s worthwhile examining another example problem, which John Abel, a postdoc on Brown’s team, often uses to highlight ways in which PCA may be useful. Let’s say we have a dataset of vehicles that are categorized based on six features, such as the height, length, number of wheels, number of passengers, size, and shape. Each feature corresponds to a dimension along which the vehicle is being analyzed. Most of the variation in this dataset will likely lie along the dimensions that map onto the size and shape of vehicles. If you did principal component analysis on this dataset, the first ...more

This highlight has been truncated due to consecutive passage length restrictions.

37%

Lagrange’s insight was: ∇f(x,y) = λ∇g(x,y) The gradient of one function is a scalar multiple, λ, of the gradient of the other function.

38%

There are many ways to project data into higher-dimensional spaces. For our purposes, such projections come with two major concerns. One has to do with Vapnik’s original algorithm, which requires taking mutual dot products of data samples. Let’s say the original dataset was in ten dimensions. That would require taking dot products of ten-dimensional vectors. If this data is linearly inseparable in 10D space, and if it were to be projected into 1,000 dimensions, where the data cleanly clumped into two separable categories, then each data point would be represented by a 1,000-dimensional vector. ...more

38%

The other concern has to do with the fact that sometimes one wants to project data into a space that has infinite dimensions. (We’ll soon see how that’s possible.) This has enormous advantages, because in an infinite-dimensional space, you can always find a separating hyperplane. But it isn’t obvious how to compute dot products of vectors of infinite dimensions, let alone store such vectors in computer memory. How, then, do you find the hyperplane?

42%

Also, Hopfield’s work on proofreading in biochemical processes was evidence that dynamical systems that could take multiple pathways through the state space to “converge” to the same final state could reduce the errors that accumulate during computation. Hopfield kept looking for a neurobiological problem that was amenable to such a solution. He finally hit upon one: associative memory. The term may be cryptic to most of us, but it’s something with which we are intuitively familiar. Think about how the strains of a song or the hint of an aroma can bring to mind an entire episode from our ...more

42%

There’s an interesting analog in magnetism. Certain materials, for example, are ferromagnetic, a state in which the magnetic moments of the material’s atoms (or ions) are all aligned, generating a net magnetism. A ferromagnet is analogous to a solid with a definite crystalline structure. However, if the magnetic moments of the atoms, or ions, are randomly oriented, the material has no permanent magnetism—analogous to the structure of glass. Each individual magnetic moment is the outcome of the spin of an elementary particle in the material. Hence, materials with disordered magnetic moments are ...more

43%

The Ising model was almost tailor-made to describe the simple neural network he had in mind. By making one more important assumption about how the artificial neurons were connected to each other (and we’ll come to the details), Hopfield could design a network whose dynamics ensured that storing or retrieving the memory was akin to putting the ensemble of neurons,

43%

and hence the network, into some stable low-energy state. This state was characterized by the strengths of the connections, or the weights, between the neurons. If you were to read off the outputs of the neurons in this stable state, they would be representative of some memory. Then, if you were to perturb the system by changing some inputs to the neurons, and hence their outputs, this would constitute a partial disruption of memory. If you read off the outputs of neurons now, they would represent distorted memory. But this perturbation would put the system into some high-energy state, and the ...more

43%

Teaching the perceptron using training data involves finding the correct set of weights for the perceptron’s inputs. However, this algorithm works only for a single-layer perceptron (meaning, you have to provide inputs to a perceptron and read off its output; you cannot feed the output of one perceptron as input to another). Minsky and Papert proved mathematically—and elegantly so—that single-layer perceptrons are ineffective when the data are not linearly separable in some given set of dimensions. They then conjectured that while multi-layer perceptrons, where the output of one layer becomes ...more

45%

So, finding the Hebbian weights for any stored pattern, or vector, y simply becomes: W = yTy - I

45%

It turns out that when the weights of the network have been set using the Hebbian learning rule, then the following are true: In the stable state, which represents a stored memory, the network’s energy (as defined by the equation above) is at a local minimum. The network can have multiple local minima (each potentially representing a different stored memory). In a stable state, neurons don’t flip their outputs any further, and the network remains at that energy minimum. However, if you were to perturb the network, say, by making it store a pattern that’s a slightly corrupted

45%

form of a stored memory, this would cause the energy of the network to increase. This perturbed state is unstable, and the neurons will start flipping. It can be shown that when a neuron flips, the overall energy of the network decreases. These dynamics continue until the network reaches a stable state, or a local energy minimum—at which point, the dynamics cease. Once the network reaches an energy minimum, the neurons stop flipping. At this stage, the outputs of the neurons potentially represent some stored memory. Whether or not the stored memory is the one you intended to retrieve depends ...more

45%

But what if we wanted to store another image in the same network? If we wanted to store only the second image, we’d set the weights to W2, where: W2 = y2Ty2 - I But if you wanted to store both images in the same network, then the composite weight matrix would be: This is the same as:

46%

Our algorithm for retrieving an image goes like this: Step 1. Calculate the energy of the perturbed network. Step 2. Pick a neuron at random from 1 to 784. Step 3. Calculate its output based on the outputs of all other neurons and the weight matrix. Step 4. Figure out whether the neuron should flip or not. Flip it if necessary. Step 5. Calculate the new energy. Step 5a. If (old energy – new energy) <= e, where e is some really small value, then terminate the process.

46%

Step 5b. If (old energy – new energy) > e, then go to step 1 (essentially, iterate over all the neurons at random, over and over, until you reach an energy minimum).

47%

To recap what we know so far, in the late 1950s and early ’60s, Frank Rosenblatt and Bernard Widrow devised single-layer neural networks and the algorithms to train them, making these networks the focus of machine learning for almost a decade. Then, in 1969, Minsky and Papert published their book, Perceptrons, in which they elegantly proved that single-layer neural networks had limitations, while insinuating (without proof) that multi-layer neural networks would likely be similarly useless, effectively killing that field of research and bringing about the first AI winter.

51%

Rosenblatt recognized this problem of symmetry in neural networks.

53%

The weight and bias vary along the x-axis and y-axis, respectively. The height along the z-axis is the loss for a given weight and bias and for some set of training data. In this case, we have ten pairs of (x, y) points that comprise our training data. For each pair, we can calculate the loss. Then we sum over all the pairs and divide by ten to get the mean squared error (MSE). It’s this value that we plot on the z-axis.

53%

Here’s the update rule (it’s called the delta rule because it increments w and b by a small amount, delta): w ← w + Δw

53%

In practice, the deltas are themselves multiplied by a small number called the learning rate, alpha, so that the weights and biases are adjusted by only a small fraction of the gradient.

54%

Most important, the function has a derivative (see the coda on this page for the derivation), and this derivative is expressed in terms of the function itself:

55%

But what about the weight and bias of the hidden neuron? Well, we continue “backpropagating” the error using the chain rule.

56%

Training eventually comes down to this: Provide the network with some set of inputs, figure out what the expected output should be (either because we humans have annotated the data and know what the output should be or because, in types of learning called self-supervised, the expected output is some known variation of the input itself), calculate the loss, calculate the gradient of the loss, update the weights/biases, rinse and repeat.

56%

Of course, publishing the paper—it’s barely three pages long—involved laying some groundwork. The trio sent it to the journal Nature. “I did some political work in Britain of going and talking to all the people who might be referees,” Hinton told me.

60%

The kernels were chosen specifically to achieve these highlights. The two kernels are: These are called Prewitt kernels, after their developer. These kernels succeed in generating new images, after the convolution, that detect horizontal and vertical edges. Keep in mind, for now, that these are hand-designed kernels. LeCun wanted his neural network to learn such kernels.

60%

So, for every position the kernel takes atop the image, we assign one neuron. In our example, for a 5×5 image and a 2×2 kernel with a stride of 1, we need 16 such neurons. The outputs of these neurons give us a 4×4 image.

60%

Now imagine taking the 4×4 image obtained after the first convolution and applying another convolution using a different 2×2 kernel. The output will be a 3×3 image. This will require 9 neurons. This is the second hidden convolution layer. Each neuron in this layer is the equivalent of a complex cell in Hubel and Wiesel’s hierarchy. Each neuron in this layer is sensitive to the value of 4 pixels in the 4×4 image generated by the previous layer.

61%

If we take just the 3×3 patch of the original image, highlighted in bold lines, here are the connections in a neural network that can transform that patch into a single pixel. First, we lay out the pixels in a straight line for easy visualization and, then, connect these pixels to their respective individual neurons. It’s clear how each neuron in the first hidden layer is only responding to four pixels. Of course, the full layer will have 16 neurons; the illustration shows only 4 of them. The 4 neurons generate 4 pixels of the next 4×4 image. These pixels/outputs become the input to the neuron ...more

61%

An untrained network will fire willy-nilly. Calculate the error between what’s expected and what the network does, and then use this error to calculate the gradients, via backpropagation. Then update the weights of all the neurons that were involved in turning the input image into an output. The updated weights ensure that for the same input, the network’s error is a tiny bit less than before. Do this for all images, over and over, until the network’s error rate is acceptably low. If we calculate the gradients for all images in the training dataset and update the weights in one go, we are ...more

See a Problem?

Preview — Why Machines Learn by Anil Ananthaswamy