Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems
Rate it:
2%
Flag icon
A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information.
2%
Flag icon
Yet another important unsupervised task is anomaly detection — for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm.
2%
Flag icon
A very similar task is novelty detection: it aims to detect new instances that look different from all instances in the training set.
3%
Flag icon
This is called instance-based learning: the system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples (or a subset of them). For example, in Figure 1-15 the new instance would be classified as a triangle because the majority of the most similar instances belong to that class.
8%
Flag icon
One of the most important transformations you need to apply to your data is feature scaling.
8%
Flag icon
With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales.
8%
Flag icon
This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Note that scaling...
This highlight has been truncated due to consecutive passage length restrictions.
8%
Flag icon
This is better than nothing, but clearly not a great score: most districts’ median_housing_values range between $120,000 and $265,000, so a typical prediction error of $68,628 is not very satisfying.
8%
Flag icon
This is an example of a model underfitting the training data. When
8%
Flag icon
As we saw in the previous chapter, the main ways to fix underfitting are to select a more powerful model, to feed the training algorithm with better features, or to reduce the constraints on the model.
10%
Flag icon
Wow! Above 93% accuracy (ratio of correct predictions) on all cross-validation folds? This looks amazing, doesn’t it? Well, before you get too excited, let’s look at a very dumb classifier that just classifies every single image in the “not-5” class:
11%
Flag icon
That’s right, it has over 90% accuracy! This is simply because only about 10% of the images are 5s, so if you always guess that an image is not a 5, you will be right about 90% of the time. Beats Nostradamus.
11%
Flag icon
This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (i.e., when some classes are much more frequent than others).
11%
Flag icon
A much better way to evaluate the performance of a classifier is
11%
Flag icon
to look at the confusion matrix. The general idea is to count the number of times instances of class A are classified as class B.
11%
Flag icon
The confusion matrix gives you a lot of information, but sometimes you may prefer a more concise metric. An interesting one to look at is the accuracy of the positive predictions; this is called the precision of the classifier (Equation 3-1).
11%
Flag icon
For example, if you trained a classifier to detect videos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision), rather than a classifier that
11%
Flag icon
has a much higher recall but lets a few really bad videos show up in your product (in such cases, you may even want to add a human pipeline to check the classifier’s video selection).
11%
Flag icon
On the other hand, suppose you train a classifier to detect shoplifters in surveillance images: it is probably fine if your classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a f...
This highlight has been truncated due to consecutive passage length restrictions.
12%
Flag icon
You now know how to train binary classifiers, choose the appropriate metric for your task, evaluate your classifiers using cross-validation, select the precision/recall trade-off that fits your needs, and use ROC curves and ROC AUC scores to compare various models. Now let’s try to detect more than just the 5s.
12%
Flag icon
One way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score. This is called the one-versus-the-rest (OvR) strategy (also called one-versus-all).
12%
Flag icon
Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the one-versus-one (OvO) strategy.
12%
Flag icon
Scikit-Learn detects when you try to use a binary classification algorithm for a multiclass classification task, and it automatically runs OvR or OvO, depending on the algorithm. Let’s try this with a Support Vector Machine classifier (see Chapter 5), using the sklearn.svm.SVC class:
12%
Flag icon
Here, we will assume that you have found a promising model and you want to find ways to improve it. One way to do this is to analyze the types of errors it makes.
12%
Flag icon
First, look at the confusion matrix.
12%
Flag icon
However, most misclassified images seem like obvious errors to us, and it’s
12%
Flag icon
hard to understand why the classifier made the mistakes it did.
12%
Flag icon
The reason is that we used a simple SGDClassifier, which is a linear model. All it does is assign a weight per class to each pixel, and when it sees a new image it just sums up the weight...
This highlight has been truncated due to consecutive passage length restrictions.
12%
Flag icon
So since 3s and 5s differ only by a few pixels, this model will ...
This highlight has been truncated due to consecutive passage length restrictions.
12%
Flag icon
In other words, this classifier is quite sensitive to image shifting and rotation.
12%
Flag icon
So one way to reduce the 3/5 confusion would be to preprocess the images to ensure that they are well centered and not too rotated.
13%
Flag icon
Looks close enough to the target! This concludes our tour of classification. You should now know how to select good metrics for classification tasks,
13%
Flag icon
pick the appropriate precision/recall trade-off, compare classifiers, and more generally build good classification systems for a variety of tasks.
14%
Flag icon
Now we will look at a very different way to train a Linear Regression model, which is better suited for cases where there are a large number of features or too many training instances to fit in memory.
14%
Flag icon
Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve.
14%
Flag icon
This implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly.
14%
Flag icon
These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough a...
This highlight has been truncated due to consecutive passage length restrictions.
14%
Flag icon
The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large.
14%
Flag icon
At the opposite extreme, Stochastic Gradient Descent picks a random instance in the training set at every step
14%
Flag icon
and computes the gradients based only on that ...
This highlight has been truncated due to consecutive passage length restrictions.
14%
Flag icon
When using Stochastic Gradient Descent, the training instances must be independent and identically distributed (IID) to ensure that the parameters get pulled toward the global optimum, on average.
14%
Flag icon
A simple way to ensure this is to shuffle the instances during training (e.g., pick each instance randomly, or shuffle the training set at the beginning of each epoch).
15%
Flag icon
But in general you won’t know what function generated the data, so how can you decide how complex your model should be?
15%
Flag icon
How can you tell that your model is overfitting or underfitting the data?
15%
Flag icon
If a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting.
15%
Flag icon
If it performs poorly on both, then it is underfitting.
15%
Flag icon
This model that’s underfitting deserves a bit of explanation.
15%
Flag icon
This means that the model performs significantly better on the training data than on the validation data, which is the hallmark of an overfitting model.
15%
Flag icon
For a linear model, regularization is typically achieved by constraining the weights of the model.
15%
Flag icon
We will now look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways to constrain the weights.
« Prev 1