Kenneth R. Lewis, Jr.’s Kindle Notes & Highlights for Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Machine Learning is the science (and art) of programming computers so they can learn from data.

Another area where Machine Learning shines is for problems that either are too complex for traditional approaches or have no known algorithm. For example, consider speech recognition:

2%

Finally, Machine Learning can help humans learn (Figure 1-4): ML algorithms can be inspected to see what they have learned

2%

To summarize, Machine Learning is great for: Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better. Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution. Fluctuating environments: a Machine Learning system can adapt to new data. Getting insights about complex problems and large amounts of data.

2%

Machine Learning systems can be classified according to the amount and type of supervision they get during training. There are four major categories: supervised learning, unsupervised learning, semisupervised learning, and Reinforcement Learning.

2%

In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels

2%

A typical supervised learning task is classification.

2%

Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of task is called regression

2%

In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has several meanings depending on the context, but generally means an attribute plus its value

2%

In unsupervised learning, as you might guess, the training data is unlabeled (Figure 1-7). The system tries to learn without a teacher.

2%

If you use a hierarchical clustering algorithm, it may also subdivide each group into smaller groups. This may help you target your posts for each group.

2%

Visualization algorithms are also good examples of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D representation of your data that can easily be plotted

2%

A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much

2%

information.

2%

Yet another important unsupervised task is anomaly detection — for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm. The system is trained with normal instances, and when it sees a new instance it can tell whether it looks like a normal one or whether it is likely an anomaly

2%

Finally, another common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes.

2%

Most semisupervised learning algorithms are combinations of unsupervised and supervised algorithms.

2%

Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as in Figure 1-12).

2%

Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data.

2%

In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline.

3%

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives

3%

Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main

3%

memory (this is called out-of-core learning).

3%

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate.

3%

One more way to categorize Machine Learning systems is by how they generalize.

3%

There are two main approaches to generalization: instance-based learning and model-based learning.

3%

This is called instance-based learning: the system learns the examples by heart, then generalizes to new cases using a similarity measure

3%

Another way to generalize from a set of examples is to build a model of these examples, then use that model to make predictions. This is called model-based learning

3%

For linear regression problems, people typically use a cost function that measures the distance between the linear model’s predictions and the training examples; the objective is to minimize this distance. This is where the Linear Regression algorithm comes in: you feed it your training examples and it finds the parameters that make the linear model fit best to your data. This is called training the model.

4%

Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful. In Machine Learning this is called overfitting: it means that the model performs well on the training data, but it does not generalize well.

4%

The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a

4%

parameter of a learning algorithm (not of the model).

4%

underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data.

4%

To avoid “wasting” too much training data in validation sets, a common technique is to use cross-validation: the training set is split into complementary subsets, and each model is trained against a different combination of these subsets and validated against the remaining parts. Once the model type and hyperparameters have been selected, a final model is trained using these hyperparameters on the full training set, and the generalized error is measured on the test set.

9%

Estimators. Any object that can estimate some parameters based on a dataset is called an estimator

9%

The estimation itself is performed by the fit() method, and it takes only a dataset as a parameter (or two for supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the estimation process is considered a hyperparameter (such as an imputer’s strategy), and it must be set as an instance variable

9%

Transformers. Some estimators (such as an imputer) can also transform a dataset; thes...

This highlight has been truncated due to consecutive passage length restrictions.

9%

the API is quite simple: the transformation is performed by the transform() method with the dataset to transform as a parameter....

This highlight has been truncated due to consecutive passage length restrictions.

9%

Predictors. Finally, some estimators are capable of making predictions given a dataset; they are called predictors.

11%

so a typical prediction error of $68,628 is not very satisfying.

The lower the RMSE (error) the better

13%

the most common supervised learning tasks are regression (predicting values) and classification (predicting classes).

14%

To compute the confusion matrix, you first need to have a set of predictions, so they can be compared to the actual targets.

14%

cross_val_predict() performs K-fold cross-validation, but instead of returning the evaluation scores, it returns the predictions made on each test fold.

14%

Each row in a confusion matrix represents an actual class, while each column represents a predicted class.

14%

A perfect classifier would have only true positives and true negatives, so its confusion matrix would have nonzero values only on its main diagonal (top left to bottom right):

14%

The confusion matrix gives you a lot of information, but sometimes you may prefer a more concise metric. An interesting one to look at is the accuracy of the positive predictions; this is called the precision of the classifier (Equation 3-1).

14%

TP is the number of true positives, and FP is the number of false positives.

14%

precision is typically used along with another metric named recall, also called sensitivity or true positive rate (TPR): this is the ratio of positive instances that are correctly detected by the classifier (Equation 3-2).

14%

FN is of course the number of false negatives.

14%

To understand this tradeoff, let’s look at how the SGDClassifier makes its classification decisions. For each instance, it computes a score based on a decision function, and if that score is greater than a threshold, it assigns the instance to the positive class, or else it assigns it to the negative class.