More on this book
Kindle Notes & Highlights
Read between
June 24 - September 11, 2025
Getting under the mathematical skin of machine learning is crucial to our understanding of not just the power of the technology, but also its limitations.
If one of the vectors involved in a dot product is of length 1, then the dot product equals the projection of the other vector onto the vector of unit length.
Silvanus P. Thompson, a professor of physics, electrical engineer, and member of the Royal Society, wrote in his classic Calculus Made Easy (first published in 1910), the “preliminary terror” of symbols in calculus “can be abolished once [and] for all by simply stating what is the meaning—in common-sense terms—of the…principal symbols.”
The gradient points away from the minimum. So, to go down toward the minimum, you must take a small step in the opposite direction or follow the negative of the gradient.
Experimental, or empirical, probability is somewhat different from theoretical probability. The theoretical probability of getting heads on a single coin toss is simply one-half, but the empirical probability depends upon the outcomes of our actual experiments.
(Thanks to something called the square root law, the counts of heads and tails will differ by a value that’s on the order of the square root of the total number of trials;
in just about every case, it’s impossible to know the underlying distribution. So, the task of probabilistic ML algorithms, one can say, comes down to estimating the distribution from data. Some algorithms do it better than others, and all make mistakes.
Estimating underlying distributions is not trivial. For starters, it’s often easier to make some simplifying assumptions about the shape of the distribution. Is it a Bernoulli distribution? Is it a normal distribution? Keep in mind that these idealized descriptions of distributions are just that: idealized; they make the math easier, but there’s no guarantee that the underlying distribution hews exactly to these mathematical forms.
Any algorithm that figures out how to separate one cluster of data points from another by identifying a boundary between them is doing discriminative learning.
Of course, the Bayes optimal classifier is an idealization, in that one assumes access to the underlying probability distributions of the data, or our best estimates of such distributions. The NN algorithm functions at the other extreme. All one has is data, and the algorithm makes barely any assumptions about, and indeed has little knowledge of, the underlying distributions. There’s no assumption, for example, that the data follows a Gaussian (bell-shaped) distribution with some mean and variance.
a matrix-vector multiplication involves taking the dot product of each row of the matrix with the column vector.
Multiplying a vector by a matrix can transform the vector, by changing not just its magnitude and orientation, but the very dimensionality of the space it inhabits.
the eigenvectors won’t be orthogonal when the matrix is not square symmetric
The eigenvectors of a covariance matrix are the principal components of the original matrix X
Hopfield showed that if you have n neurons, the network can store at most 0.14×n memories.
If a network requires more than two weight matrices (one for the output layer and one for each hidden layer), then it’s called a deep neural network:
the deep neural networks that are dominating today’s efforts in AI—with billions, even hundreds of billions of neurons and tens, even hundreds of hidden layers—are challenging the theoretical foundations of machine learning. For one, these networks aren’t as susceptible to the curse of dimensionality as was expected, for reasons that aren’t entirely clear.
There’s a very important and interesting question about whether biological brains do backpropagation. The algorithm is considered biologically implausible, precisely because it needs to store the entire weight matrix used during the forward pass; no one knows how an immensely large biological neural network would keep such weight matrices in memory. It’s very likely that our brains are implementing a different learning algorithm.)
in types of learning called self-supervised, the expected output is some known variation of the input itself),
(the top-5 error rate refers to the percentage of times the correct label for an image does not appear in the top five most likely labels predicted by the ML model).
Theory of mind is a cognitive ability humans have that allows us to make inferences about someone else’s beliefs or state of mind using only external behavioral cues such as body language and the overall context.
ML algorithms assume that the data on which they have been trained are drawn from some underlying distribution and that the unseen data on which they make predictions are also drawn from the same distribution. If an ML system encounters real-world data that falls afoul of this assumption, all bets are off as to the predictions.
the company Hugging Face, which works on open-source models, calculated that one of its models (a 175-billion-parameter network named BLOOM), during an eighteen-day period, consumed, on average, about 1,664 watts. Compare that to the 20 to 50 watts our brains use, despite our having about 86 billion neurons and about 100 trillion connections, or parameters. On the face of it, it’s no comparison, really.

