How did Newton find out or learned the laws of motion or gravitation? First of all he had the assumption that hidden "functions" are at play. This is a big assumption because function is a specific type of relation. However, from human experience, it turns out thinking about nature in terms of function therefore in terms of relations for which one input cannot map to two or more different outputs is a fruitful path. Newton probably had a guess about the set of candidate functions that could explain the motion or gravitation. More challenging perhaps was to first find features of motion like position/coordinates, speed, acceleration, mass/inertia, force, momentum. He probably used thought experiments to efficiently search for “the best” function and compared each promising function’s performance against how well it explains motion or gravitation.
(Supervised) machine learning also tries to find “the best” function (aka model) that explains the relationship between a set of inputs and outputs. What’s the use? Like Newton's laws of motion, once the function has been learned or found, it can be used in unknown scenarios. Say predict if it will rain tomorrow. Or predict if the user will click this ad. Or say predict if the user will choose a binge watch from this short list of 20 shows. Or say predict this is an image of a dog. Or translate this Chinese sentence into english.
Machine learning, like Newton, considers a set of candidate functions that an algorithm searches through, comparing each according to some “goodness” criteria. So, in learning there are five ingredients:
1. data: example pairs of (input, output)
2. an assumption (aka inductive bias) about what kind of functions may be candidates to explain the relationship between input and output in (1)
3. a representation of the candidate functions
4. a searching algorithm
5. a goodness criteria that will be used by (4) to decide how well candidate function explains the input->output relationship
Say we have a blackbox that has a hidden function. We guess the hidden function is one of: addition, subtraction, multiplication, and division.
We have three examples or observations:
black_box(1, 3) = 3, given 1 and 3 as inputs, the box produces 3 as output.
black_box(2, 2) = 4
black_box(3, 1) = 3
One way to find out would be to try all four possible functions and see which one matches with the most outputs, that would be our “goodness” criterion. So the goodness of the four possible functions across those three examples are:
addition has goodness = 1, therefore it explains only one example
subtraction has goodness = 0
multiplication has goodness = 3
division has goodness = 1
So, reasonably we can say the black box is doing multiplication. That would be the learned model/function of the black box.
There are a few challenges:
Data is noisy, so learning process should try to discard noise
The number of candidate functions can be way too higher than the data at hand. So, there may not be a “best” choice given data. Therefore multiple functions may explain the input->output map kind of equally well. To get around this, newton had used his and others' experience about how motion works to cut down on the possible functions, which is called inductive bias, because newton is showing bias towards a special set of functions. It is called inductive bias because in the end the whole process is a generalization from specific examples aka induction.
Deep learning is a subset of machine learning that uses neural networks to represent the candidate functions in (3). It is a very flexible representation. In other words, a neural network can represent a very large number of possible functions. Therefore it risks learning noise if data is scanty. In other words, if there are only a few examples, it fails to narrow down the number of candidate functions and ends up choosing one that just explains these few examples well but performs badly on new examples. Good thing is, with online businesses there is plenty of data. Thus the neural network is a good fit now-a-days with “big” data. Another good thing about neural networks or deep learning is, it can auto-learn how to discard noise from data. This is called feature engineering and with other machine learning models it is a very manual, time and labor-intensive process.
Think of data as a table. Each row is one example. Each column is describing one salient feature of those examples. There usually is a special column at the end, which is the output. Therefore if there are 4 columns, the first 3 makes up the input domain of the function being learned and the 4th column is the output of that function.
A artificial neural network is inspired by biological neuron network. Like a biological neuron, each artificial neuron is simple: a weighted sum followed by an activation function. The power comes from how an aggregate of neuron interacts in the network. From (3) we want a representation that could capture a sufficiently large class of functions from which we could choose a good one that mimics or learns the real process we are trying to model.
In deep learning, to find the best function we find the minimum of error surface which lies above the weight space. Moving from one point to another in the weight space amounts to choosing one function over another. Bowl-shaped convex error surfaces are nice because there is a single minimum so if we follow the local gradient of the error surface, we are sure to reach the minimum, even if in a rather circuitous manner. Since deep neural networks incorporate a non-linear activation function in neurons, the error surface may look more like a bumpy hill and the function-searcher may get stuck at a local minimum thinking myopically mistakes the local minimum as the global. There are ways to get around. A popular function searching algorithm is gradient descent. Given the difference between predicted output and actual output the gradient descent will tell us how much and which direction we need to move in the weight space to go downhill in the error surface hanging above. This works for a single neuron's error surface. If we know the error of the entire neural network and we want to update weights belonging to the many constituent neurons, we need to somehow assign the blame of error to different neurons and then use gradient descent to update the weights of a particular neuron. This blame assignment is done using another algorithm called back propagation. Basically both algorithms are fancy name for chain-rule of differentiation.
In a deep neural network, the neurons are arranges in layers, think of going left to right, input layer, hidden layer 1, hidden layer 2, etc., then output layer of neurons. The earlier layers of neurons usually learn to detect smaller sub-problems and how to solve those. The latter layers of neurons build upon those solutions of the sub-problems. So, deep learning is based on divide-and-conquer strategy for solving a complicated problem.
The books hints some exciting problems in the deep learning that are being worked on right now: GANs, neuromorphic computing, quantum computing, interpretability. Interpretability in particular looked attactive to me. It is the problem of explaining how a deep neural network came to make a particular decision. This is hard to explain with a deep neural network with many neurons. Almost, like trying to explain an idea in human mind: what caused the idea to come, how? There are some avenues folks are taking to solve this problem, one is like coming up with FMRI/PET like styles, find out certain inputs that trigger certain decision we are interested in. Once such inputs/examples are found now dig deeper into the example and how it affects the connection weights between neurons and try to explain why the networks is coming up with that decision.