More on this book
Community
Kindle Notes & Highlights
Started reading
August 25, 2019
An important characteristic of Lasso Regression is that it tends to eliminate the weights of the least important features (i.e., set them to zero).
In other words, Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights).
The Lasso cost function is not differentiable at θi = 0 (for i = 1, 2, ⋯, n), but Gradient Descent still works fine if you use a subgradient vector g13 instead when any θi = 0.
It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are useful, you should prefer Lasso or Elastic Net because they tend to reduce the useless features’ weights down to zero, as we have discussed.
Cross entropy is frequently used to measure how well a set of estimated class probabilities matches the target classes.
You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called large margin classification.
This applies regular Stochastic Gradient Descent (see Chapter 4) to train a linear SVM classifier. It does not converge as fast as the LinearSVC class, but it can be useful to handle online classification tasks or huge datasets that do not fit in memory (out-of-core training).
This algorithm is perfect for complex small or medium-sized training sets. It scales well with the number of features, especially with sparse features (i.e., when each instance has few nonzero features).
To use SVMs for regression instead of classification, the trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations (i.e., instances off the street).
Training a linear SVM classifier means finding the values of w and b that make this margin as wide as possible while avoiding margin violations (hard margin) or limiting them (soft margin).
So we want to minimize ∥ w ∥ to get a large margin. If we also want to avoid any margin violations (hard margin), then we need the decision function to be greater than 1 for all positive training instances and lower than –1 for negative training instances.
Given a constrained optimization problem, known as the primal problem, it is possible to express a different but closely related problem, called its dual problem.
The solution to the dual problem typically gives a lower bound to the solution of the primal problem, but under some conditions it can have the same solution as the
As discussed in Chapter 2, you will often use Ensemble methods near the end of a project, once you have already built a few good predictors, to combine them into an even better predictor. In fact, the winning solutions in Machine Learning competitions often involve several Ensemble methods (most famously in the Netflix Prize competition).
In this chapter we will discuss the curse of dimensionality and get a sense of what goes on in high-dimensional space.
Then, we will consider the two main approaches to dimensionality reduction (projection and Manifold Learning), and we will go through three of the most popular dimensionality reduction techniques: PCA, Kernel PCA, and LLE.
This is counterintuitive: how can two points be so far apart when they both lie within the same unit hypercube? Well, there’s just plenty of space in high dimensions. As a result, high-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other.
In most real-world problems, training instances are not spread out uniformly across all dimensions.
As a result, all training instances lie within (or close to) a much lower-dimensional subspace of the high-dimensional space. This sounds very abstract, so let’s look at an example. In Figure 8-2 you can see a 3D dataset represented by circles.
Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called Manifold Learning.
It relies on the manifold assumption, also called the manifold hypothesis, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. This assumption is very often empirically observed.
Once again, think about the MNIST dataset: all handwritten digit images have some similarities. They are made of connected lines, the borders are white, and they are more or less centered. If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits.
The manifold assumption is often accompanied by another implicit assumption: that the task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold.
Recall that a linear decision boundary in the high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space.
As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values.
That said, dimensionality reduction is often a preparation step for a supervised learning task (e.g., classification), so you can use grid search to select the kernel and hyperparameters that lead to the best performance on that task.
Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This point is called the reconstruction pre-image.
Once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error.
Siegrid Löwel later summarized Hebb’s idea in the catchy phrase, “Cells that fire together, wire together”; that is, the connection weight between two neurons tends to increase when they fire simultaneously. This rule later became known as Hebb’s rule (or Hebbian learning).
Perceptrons are trained using a variant of this rule that takes into account the error made by the network when it makes a prediction; the Perceptron learning rule reinforces connections that help reduce the error.
But in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a groundbreaking paper11 that introduced the backpropagation training algorithm, which is still used today.
In short, it is Gradient Descent (introduced in Chapter 4) using an efficient technique for computing the gradients automatically:12 in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network’s error with regard to every single model parameter.
So if you don’t have some nonlinearity between layers, then even a deep stack of layers is equivalent to a single layer, and you can’t solve very complex problems with that.
Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.
For example, if you have already trained a model to recognize faces in pictures and you now want to train a new neural network to recognize hairstyles, you can kickstart the training by reusing the lower layers of the first network.
This is called transfer learning.
If you do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your
actual task.
Gathering hundreds of pictures of each person would not be practical.
You could, however, gather a lot of pictures of random people on the web and train a first neural network to detect whether or not two different pictures feature the same person.
Such a network would learn good feature detectors for faces, so reusing its lower layers would allow you to train a good face class...
This highlight has been truncated due to consecutive passage length restrictions.
For natural language processing (NLP) applications, you can download a corpus of millions of text documents and automatic...
This highlight has been truncated due to consecutive passage length restrictions.
For example, you could randomly mask out some words an...
This highlight has been truncated due to consecutive passage length restrictions.
predict what the missing words are (e.g., it should predict that the missing word in the sentence “What ___ you sayin...
This highlight has been truncated due to consecutive passage length restrictions.
If you can train a model to reach good performance on this task, then it will already know quite a lot about language, and you can certainly reuse it for your actual task and fine-tune it on your l...
This highlight has been truncated due to consecutive passage length restrictions.
Google’s TensorFlow Hub provides a way to easily download and reuse pretrained neural networks.
Indeed, the better the representation, the easier it will be for the neural network to make accurate predictions, so training tends to make embeddings useful representations of the categories. This is called representation learning (we will see other types of representation learning in
Let’s look at how we could implement embeddings manually, to understand how they work (then we will use a simple Keras layer instead).