More on this book
Community
Kindle Notes & Highlights
Started reading
August 25, 2019
A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information.
Yet another important unsupervised task is anomaly detection — for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm.
A very similar task is novelty detection: it aims to detect new instances that look different from all instances in the training set.
This is called instance-based learning: the system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples (or a subset of them). For example, in Figure 1-15 the new instance would be classified as a triangle because the majority of the most similar instances belong to that class.
One of the most important transformations you need to apply to your data is feature scaling.
With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales.
This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Note that scaling...
This highlight has been truncated due to consecutive passage length restrictions.
This is better than nothing, but clearly not a great score: most districts’ median_housing_values range between $120,000 and $265,000, so a typical prediction error of $68,628 is not very satisfying.
This is an example of a model underfitting the training data. When
As we saw in the previous chapter, the main ways to fix underfitting are to select a more powerful model, to feed the training algorithm with better features, or to reduce the constraints on the model.
Wow! Above 93% accuracy (ratio of correct predictions) on all cross-validation folds? This looks amazing, doesn’t it? Well, before you get too excited, let’s look at a very dumb classifier that just classifies every single image in the “not-5” class:
That’s right, it has over 90% accuracy! This is simply because only about 10% of the images are 5s, so if you always guess that an image is not a 5, you will be right about 90% of the time. Beats Nostradamus.
This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (i.e., when some classes are much more frequent than others).
A much better way to evaluate the performance of a classifier is
to look at the confusion matrix. The general idea is to count the number of times instances of class A are classified as class B.
The confusion matrix gives you a lot of information, but sometimes you may prefer a more concise metric. An interesting one to look at is the accuracy of the positive predictions; this is called the precision of the classifier (Equation 3-1).
For example, if you trained a classifier to detect videos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision), rather than a classifier that
has a much higher recall but lets a few really bad videos show up in your product (in such cases, you may even want to add a human pipeline to check the classifier’s video selection).
On the other hand, suppose you train a classifier to detect shoplifters in surveillance images: it is probably fine if your classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a f...
This highlight has been truncated due to consecutive passage length restrictions.
You now know how to train binary classifiers, choose the appropriate metric for your task, evaluate your classifiers using cross-validation, select the precision/recall trade-off that fits your needs, and use ROC curves and ROC AUC scores to compare various models. Now let’s try to detect more than just the 5s.
One way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score. This is called the one-versus-the-rest (OvR) strategy (also called one-versus-all).
Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the one-versus-one (OvO) strategy.
Scikit-Learn detects when you try to use a binary classification algorithm for a multiclass classification task, and it automatically runs OvR or OvO, depending on the algorithm. Let’s try this with a Support Vector Machine classifier (see Chapter 5), using the sklearn.svm.SVC class:
Here, we will assume that you have found a promising model and you want to find ways to improve it. One way to do this is to analyze the types of errors it makes.
First, look at the confusion matrix.
However, most misclassified images seem like obvious errors to us, and it’s
hard to understand why the classifier made the mistakes it did.
The reason is that we used a simple SGDClassifier, which is a linear model. All it does is assign a weight per class to each pixel, and when it sees a new image it just sums up the weight...
This highlight has been truncated due to consecutive passage length restrictions.
So since 3s and 5s differ only by a few pixels, this model will ...
This highlight has been truncated due to consecutive passage length restrictions.
In other words, this classifier is quite sensitive to image shifting and rotation.
So one way to reduce the 3/5 confusion would be to preprocess the images to ensure that they are well centered and not too rotated.
Looks close enough to the target! This concludes our tour of classification. You should now know how to select good metrics for classification tasks,
pick the appropriate precision/recall trade-off, compare classifiers, and more generally build good classification systems for a variety of tasks.
Now we will look at a very different way to train a Linear Regression model, which is better suited for cases where there are a large number of features or too many training instances to fit in memory.
Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve.
This implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly.
These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough a...
This highlight has been truncated due to consecutive passage length restrictions.
The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large.
At the opposite extreme, Stochastic Gradient Descent picks a random instance in the training set at every step
and computes the gradients based only on that ...
This highlight has been truncated due to consecutive passage length restrictions.
When using Stochastic Gradient Descent, the training instances must be independent and identically distributed (IID) to ensure that the parameters get pulled toward the global optimum, on average.
A simple way to ensure this is to shuffle the instances during training (e.g., pick each instance randomly, or shuffle the training set at the beginning of each epoch).
But in general you won’t know what function generated the data, so how can you decide how complex your model should be?
How can you tell that your model is overfitting or underfitting the data?
If a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting.
If it performs poorly on both, then it is underfitting.
This model that’s underfitting deserves a bit of explanation.
This means that the model performs significantly better on the training data than on the validation data, which is the hallmark of an overfitting model.
For a linear model, regularization is typically achieved by constraining the weights of the model.
We will now look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways to constrain the weights.