Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems
Rate it:
15%
Flag icon
An important characteristic of Lasso Regression is that it tends to eliminate the weights of the least important features (i.e., set them to zero).
15%
Flag icon
In other words, Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights).
16%
Flag icon
The Lasso cost function is not differentiable at θi = 0 (for i = 1, 2, ⋯, n), but Gradient Descent still works fine if you use a subgradient vector g13 instead when any θi = 0.
16%
Flag icon
It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are useful, you should prefer Lasso or Elastic Net because they tend to reduce the useless features’ weights down to zero, as we have discussed.
17%
Flag icon
Cross entropy is frequently used to measure how well a set of estimated class probabilities matches the target classes.
17%
Flag icon
You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called large margin classification.
17%
Flag icon
This applies regular Stochastic Gradient Descent (see Chapter 4) to train a linear SVM classifier. It does not converge as fast as the LinearSVC class, but it can be useful to handle online classification tasks or huge datasets that do not fit in memory (out-of-core training).
18%
Flag icon
This algorithm is perfect for complex small or medium-sized training sets. It scales well with the number of features, especially with sparse features (i.e., when each instance has few nonzero features).
18%
Flag icon
To use SVMs for regression instead of classification, the trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations (i.e., instances off the street).
18%
Flag icon
Training a linear SVM classifier means finding the values of w and b that make this margin as wide as possible while avoiding margin violations (hard margin) or limiting them (soft margin).
18%
Flag icon
So we want to minimize ∥ w ∥ to get a large margin. If we also want to avoid any margin violations (hard margin), then we need the decision function to be greater than 1 for all positive training instances and lower than –1 for negative training instances.
18%
Flag icon
Given a constrained optimization problem, known as the primal problem, it is possible to express a different but closely related problem, called its dual problem.
18%
Flag icon
The solution to the dual problem typically gives a lower bound to the solution of the primal problem, but under some conditions it can have the same solution as the
20%
Flag icon
As discussed in Chapter 2, you will often use Ensemble methods near the end of a project, once you have already built a few good predictors, to combine them into an even better predictor. In fact, the winning solutions in Machine Learning competitions often involve several Ensemble methods (most famously in the Netflix Prize competition).
23%
Flag icon
In this chapter we will discuss the curse of dimensionality and get a sense of what goes on in high-dimensional space.
23%
Flag icon
Then, we will consider the two main approaches to dimensionality reduction (projection and Manifold Learning), and we will go through three of the most popular dimensionality reduction techniques: PCA, Kernel PCA, and LLE.
23%
Flag icon
This is counterintuitive: how can two points be so far apart when they both lie within the same unit hypercube? Well, there’s just plenty of space in high dimensions. As a result, high-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other.
23%
Flag icon
In most real-world problems, training instances are not spread out uniformly across all dimensions.
23%
Flag icon
As a result, all training instances lie within (or close to) a much lower-dimensional subspace of the high-dimensional space. This sounds very abstract, so let’s look at an example. In Figure 8-2 you can see a 3D dataset represented by circles.
23%
Flag icon
Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called Manifold Learning.
23%
Flag icon
It relies on the manifold assumption, also called the manifold hypothesis, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. This assumption is very often empirically observed.
23%
Flag icon
Once again, think about the MNIST dataset: all handwritten digit images have some similarities. They are made of connected lines, the borders are white, and they are more or less centered. If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits.
Alexandre Gomes
The dimension of the space is width * height, where width and height are given in pixels
23%
Flag icon
The manifold assumption is often accompanied by another implicit assumption: that the task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold.
24%
Flag icon
Recall that a linear decision boundary in the high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space.
24%
Flag icon
As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values.
24%
Flag icon
That said, dimensionality reduction is often a preparation step for a supervised learning task (e.g., classification), so you can use grid search to select the kernel and hyperparameters that lead to the best performance on that task.
24%
Flag icon
Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This point is called the reconstruction pre-image.
24%
Flag icon
Once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error.
30%
Flag icon
Siegrid Löwel later summarized Hebb’s idea in the catchy phrase, “Cells that fire together, wire together”; that is, the connection weight between two neurons tends to increase when they fire simultaneously. This rule later became known as Hebb’s rule (or Hebbian learning).
30%
Flag icon
Perceptrons are trained using a variant of this rule that takes into account the error made by the network when it makes a prediction; the Perceptron learning rule reinforces connections that help reduce the error.
30%
Flag icon
But in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a groundbreaking paper11 that introduced the backpropagation training algorithm, which is still used today.
30%
Flag icon
In short, it is Gradient Descent (introduced in Chapter 4) using an efficient technique for computing the gradients automatically:12 in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network’s error with regard to every single model parameter.
30%
Flag icon
So if you don’t have some nonlinearity between layers, then even a deep stack of layers is equivalent to a single layer, and you can’t solve very complex problems with that.
30%
Flag icon
Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.
35%
Flag icon
For example, if you have already trained a model to recognize faces in pictures and you now want to train a new neural network to recognize hairstyles, you can kickstart the training by reusing the lower layers of the first network.
35%
Flag icon
This is called transfer learning.
38%
Flag icon
If you do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your
38%
Flag icon
actual task.
38%
Flag icon
Gathering hundreds of pictures of each person would not be practical.
38%
Flag icon
You could, however, gather a lot of pictures of random people on the web and train a first neural network to detect whether or not two different pictures feature the same person.
38%
Flag icon
Such a network would learn good feature detectors for faces, so reusing its lower layers would allow you to train a good face class...
This highlight has been truncated due to consecutive passage length restrictions.
38%
Flag icon
For natural language processing (NLP) applications, you can download a corpus of millions of text documents and automatic...
This highlight has been truncated due to consecutive passage length restrictions.
38%
Flag icon
For example, you could randomly mask out some words an...
This highlight has been truncated due to consecutive passage length restrictions.
38%
Flag icon
predict what the missing words are (e.g., it should predict that the missing word in the sentence “What ___ you sayin...
This highlight has been truncated due to consecutive passage length restrictions.
38%
Flag icon
If you can train a model to reach good performance on this task, then it will already know quite a lot about language, and you can certainly reuse it for your actual task and fine-tune it on your l...
This highlight has been truncated due to consecutive passage length restrictions.
41%
Flag icon
Google’s TensorFlow Hub provides a way to easily download and reuse pretrained neural networks.
49%
Flag icon
Indeed, the better the representation, the easier it will be for the neural network to make accurate predictions, so training tends to make embeddings useful representations of the categories. This is called representation learning (we will see other types of representation learning in
49%
Flag icon
Let’s look at how we could implement embeddings manually, to understand how they work (then we will use a simple Keras layer instead).
« Prev 1 2 Next »