What do you think?
Rate this book
480 pages, Hardcover
First published July 16, 2024
Getting under the mathematical skin of machine learning is crucial to our understanding of not just the power of the technology, but also its limitations.
If one of the vectors involved in a dot product is of length 1, then the dot product equals the projection of the other vector onto the vector of unit length.
The gradient points away from the minimum. So, to go down toward the minimum, you must take a small step in the opposite direction or follow the negative of the gradient.
Experimental, or empirical, probability is somewhat different from theoretical probability. The theoretical probability of getting heads on a single coin toss is simply one-half, but the empirical probability depends upon the outcomes of our actual experiments.
Thanks to something called the square root law, the counts of heads and tails will differ by a value that’s on the order of the square root of the total number of trials
in just about every case, it’s impossible to know the underlying distribution. So, the task of probabilistic ML algorithms, one can say, comes down to estimating the distribution from data. Some algorithms do it better than others, and all make mistakes.
Estimating underlying distributions is not trivial. For starters, it’s often easier to make some simplifying assumptions about the shape of the distribution. Is it a Bernoulli distribution? Is it a normal distribution? Keep in mind that these idealized descriptions of distributions are just that: idealized; they make the math easier, but there’s no guarantee that the underlying distribution hews exactly to these mathematical forms.
Any algorithm that figures out how to separate one cluster of data points from another by identifying a boundary between them is doing discriminative learning.
Of course, the Bayes optimal classifier is an idealization, in that one assumes access to the underlying probability distributions of the data, or our best estimates of such distributions. The NN algorithm functions at the other extreme. All one has is data, and the algorithm makes barely any assumptions about, and indeed has little knowledge of, the underlying distributions. There’s no assumption, for example, that the data follows a Gaussian (bell-shaped) distribution with some mean and variance.
Multiplying a vector by a matrix can transform the vector, by changing not just its magnitude and orientation, but the very dimensionality of the space it inhabits.
the eigenvectors won’t be orthogonal when the matrix is not square symmetric
The eigenvectors of a covariance matrix are the principal components of the original matrix X
Hopfield showed that if you have n neurons, the network can store at most 0.14×n memories.
If a network requires more than two weight matrices (one for the output layer and one for each hidden layer), then it’s called a deep neural network
the deep neural networks that are dominating today’s efforts in AI—with billions, even hundreds of billions of neurons and tens, even hundreds of hidden layers—are challenging the theoretical foundations of machine learning. For one, these networks aren’t as susceptible to the curse of dimensionality as was expected, for reasons that aren’t entirely clear.
There’s a very important and interesting question about whether biological brains do backpropagation. The algorithm is considered biologically implausible, precisely because it needs to store the entire weight matrix used during the forward pass; no one knows how an immensely large biological neural network would keep such weight matrices in memory. It’s very likely that our brains are implementing a different learning algorithm.
in types of learning called self-supervised, the expected output is some known variation of the input itself
the top-5 error rate refers to the percentage of times the correct label for an image does not appear in the top five most likely labels predicted by the ML model
Theory of mind is a cognitive ability humans have that allows us to make inferences about someone else’s beliefs or state of mind using only external behavioral cues such as body language and the overall context.
ML algorithms assume that the data on which they have been trained are drawn from some underlying distribution and that the unseen data on which they make predictions are also drawn from the same distribution. If an ML system encounters real-world data that falls afoul of this assumption, all bets are off as to the predictions.
the company Hugging Face, which works on open-source models, calculated that one of its models (a 175-billion-parameter network named BLOOM), during an eighteen-day period, consumed, on average, about 1,664 watts. Compare that to the 20 to 50 watts our brains use, despite our having about 86 billion neurons and about 100 trillion connections, or parameters. On the face of it, it’s no comparison, really.