Why Machines Learn: The Elegant Math Behind Modern AI
Rate it:
Open Preview
Read between June 24 - September 11, 2025
2%
Flag icon
Getting under the mathematical skin of machine learning is crucial to our understanding of not just the power of the technology, but also its limitations.
7%
Flag icon
If one of the vectors involved in a dot product is of length 1, then the dot product equals the projection of the other vector onto the vector of unit length.
13%
Flag icon
Silvanus P. Thompson, a professor of physics, electrical engineer, and member of the Royal Society, wrote in his classic Calculus Made Easy (first published in 1910), the “preliminary terror” of symbols in calculus “can be abolished once [and] for all by simply stating what is the meaning—in common-sense terms—of the…principal symbols.”
14%
Flag icon
The gradient points away from the minimum. So, to go down toward the minimum, you must take a small step in the opposite direction or follow the negative of the gradient.
19%
Flag icon
Experimental, or empirical, probability is somewhat different from theoretical probability. The theoretical probability of getting heads on a single coin toss is simply one-half, but the empirical probability depends upon the outcomes of our actual experiments.
19%
Flag icon
(Thanks to something called the square root law, the counts of heads and tails will differ by a value that’s on the order of the square root of the total number of trials;
20%
Flag icon
in just about every case, it’s impossible to know the underlying distribution. So, the task of probabilistic ML algorithms, one can say, comes down to estimating the distribution from data. Some algorithms do it better than others, and all make mistakes.
21%
Flag icon
Estimating underlying distributions is not trivial. For starters, it’s often easier to make some simplifying assumptions about the shape of the distribution. Is it a Bernoulli distribution? Is it a normal distribution? Keep in mind that these idealized descriptions of distributions are just that: idealized; they make the math easier, but there’s no guarantee that the underlying distribution hews exactly to these mathematical forms.
25%
Flag icon
Any algorithm that figures out how to separate one cluster of data points from another by identifying a boundary between them is doing discriminative learning.
28%
Flag icon
Of course, the Bayes optimal classifier is an idealization, in that one assumes access to the underlying probability distributions of the data, or our best estimates of such distributions. The NN algorithm functions at the other extreme. All one has is data, and the algorithm makes barely any assumptions about, and indeed has little knowledge of, the underlying distributions. There’s no assumption, for example, that the data follows a Gaussian (bell-shaped) distribution with some mean and variance.
31%
Flag icon
a matrix-vector multiplication involves taking the dot product of each row of the matrix with the column vector.
31%
Flag icon
Multiplying a vector by a matrix can transform the vector, by changing not just its magnitude and orientation, but the very dimensionality of the space it inhabits.
31%
Flag icon
the eigenvectors won’t be orthogonal when the matrix is not square symmetric
32%
Flag icon
The eigenvectors of a covariance matrix are the principal components of the original matrix X
44%
Flag icon
Hopfield showed that if you have n neurons, the network can store at most 0.14×n memories.
46%
Flag icon
If a network requires more than two weight matrices (one for the output layer and one for each hidden layer), then it’s called a deep neural network:
49%
Flag icon
the deep neural networks that are dominating today’s efforts in AI—with billions, even hundreds of billions of neurons and tens, even hundreds of hidden layers—are challenging the theoretical foundations of machine learning. For one, these networks aren’t as susceptible to the curse of dimensionality as was expected, for reasons that aren’t entirely clear.
53%
Flag icon
There’s a very important and interesting question about whether biological brains do backpropagation. The algorithm is considered biologically implausible, precisely because it needs to store the entire weight matrix used during the forward pass; no one knows how an immensely large biological neural network would keep such weight matrices in memory. It’s very likely that our brains are implementing a different learning algorithm.)
53%
Flag icon
in types of learning called self-supervised, the expected output is some known variation of the input itself),
60%
Flag icon
(the top-5 error rate refers to the percentage of times the correct label for an image does not appear in the top five most likely labels predicted by the ML model).
66%
Flag icon
Theory of mind is a cognitive ability humans have that allows us to make inferences about someone else’s beliefs or state of mind using only external behavioral cues such as body language and the overall context.
67%
Flag icon
ML algorithms assume that the data on which they have been trained are drawn from some underlying distribution and that the unseen data on which they make predictions are also drawn from the same distribution. If an ML system encounters real-world data that falls afoul of this assumption, all bets are off as to the predictions.
68%
Flag icon
the company Hugging Face, which works on open-source models, calculated that one of its models (a 175-billion-parameter network named BLOOM), during an eighteen-day period, consumed, on average, about 1,664 watts. Compare that to the 20 to 50 watts our brains use, despite our having about 86 billion neurons and about 100 trillion connections, or parameters. On the face of it, it’s no comparison, really.