More on this book
Kindle Notes & Highlights
When we predict integers or continuous values, such as the number of fruits purchased, we would be solving a regression problem (see Figure 1a). When we predict binary or categorical values, such as whether it would rain or not, we would be solving a classification problem
Reinforcement Learning Task: Use the patterns in my data to make predictions, and improve these predictions as more results come in. Unlike unsupervised and supervised learning, where models are learned and then deployed without further changes, a reinforcement learning model continuously improves itself using feedback from results.
An overfitted model would yield highly accurate predictions for the current data, but would be less generalizable to future data.
An underfitted model is likely to neglect significant trends, which would cause it to yield less accurate predictions for both current and future data.
One way to keep a model’s overall complexity in check is to introduce a penalty parameter, in a step known as regularization. This new parameter penalizes any increase in a model’s complexity by artificially inflating prediction error, thus enabling the algorithm to account for both complexity and accuracy in optimizing its original parameters. By keeping a model simple, we help to maintain its generalizability.
Confusion Matrix. Confusion matrices provide further insight into where our prediction model succeeded and where it failed.
Cross-validation maximizes the availability of data for validation by dividing the dataset into several segments that are used to test the model repeatedly.
This is how clustering works: by identifying common preferences or characteristics, it is possible to sort customers into groups, which retailers may then use for targeted advertisement.
A scree plot shows how within-cluster scatter decreases as the number of clusters increases.
Principal Component Analysis (PCA) is a technique that finds the underlying variables (known as principal components) that best differentiate your data points.
PCA works best when the most informative dimensions have the largest data spread and are orthogonal to each other.
The Louvain method is one way to identify clusters in a network. It experiments with different clustering configurations to 1) maximize the number and strength of edges between nodes in the same cluster, while 2) minimizing those between nodes in different clusters.
To reduce the risk of getting stuck in a pit, we could instead use stochastic gradient descent, where rather than using every data point to adjust parameters in each iteration, we reference only one.
Weights of regression predictors are formally called regression coefficients. A predictor’s regression coefficient measures how strong that predictor is, in the presence of other predictors. In other words, it is the value added by that predictor, rather than its absolute predictive strength.
When there is only one predictor, the beta weight of that predictor is called a correlation coefficient, denoted as r. Correlation coefficients range from -1 to 1,
k-Nearest Neighbors (k-NN) is an algorithm that classifies a data point based on the classification of its neighbors.
The k in k-NN is a parameter referring to the number of nearest neighbors to include in the majority voting process.
As k-NN uses underlying patterns in the data to make predictions, prediction errors are thus telltale signs of data points that do not conform to overall trends. In fact, this approach means that any algorithm that generates a predictive model could be used to detect anomalies.
The main objective of SVM is to derive an optimal boundary that separates one group from another.
SVM algorithm has one key feature, a buffer zone, that allows a limited number of training data points to cross over to the incorrect side. This results in a ‘softer’ boundary that is more robust against outliers, and hence more generalizable to new data.
While there are numerous other techniques that could do this, SVM is favored for its superior computational efficiency in deriving intricately curved patterns through a method called the kernel trick.
SVM is only able to classify two groups at a time. If there are more than two groups, SVM would need to be iterated to distinguish each group from the rest through a technique known as multi-class SVM.
There are two methods to diversifying trees: The first method chooses different combinations of binary questions at random to grow multiple trees, and then aggregates the predictions from those trees. This technique is known as building a random forest (see Chapter 10). Instead of choosing binary questions at random, the second method strategically selects binary questions, such that prediction accuracy for each subsequent tree improves incrementally. A weighted average of predictions from all trees is then taken to obtain the result. This technique is known as gradient boosting.
A random forest is an ensemble of decision trees. An ensemble is the term for a prediction model generated by combining predictions from many different models, such as by majority voting or by taking averages.
Bootstrap aggregating (also termed as bagging) is used to create thousands of decision trees that are adequately different from each other. To ensure minimal correlation between trees, each tree is generated from a random subset of the training data, using a random subset of predictor variables.
Nonetheless, random forests are widely used because they are easy to implement. They are particularly effective in situations where the accuracy of results is more crucial than their interpretability.
To improve predictions, a convolution layer could be used. Instead of processing individual pixels, this layer identifies features made from combinations of pixels, such as the presence of a circle or an upward-pointing tail in the digit ‘6’.
Learning the right values for weights and thresholds is essential for good activation rules that lead to accurate predictions. In addition, a neural network’s other parameters also require tuning, such as the number of hidden layers and number of neurons within each layer. To optimize these parameters, gradient descent (see Chapter 6.3) could be used.
This approach leverages the epsilon-decreasing strategy. Epsilon refers to the proportion of time spent exploring an alternative, to ensure that it is indeed less effective. Since we decrease epsilon as our confidence in the better ad is reinforced, this technique belongs to a class of algorithms known as reinforcement learning.
Slot machines have been nicknamed one-arm bandits since they appear to cheat players of money with each arm pull. As such, choosing which slot machine to play is known as a multi-arm bandit problem, which is a term that now refers to any problem on resource allocation, such as deciding which ad to show, which topics to revise before an exam, or which drug study to fund.

