More on this book
Community
Kindle Notes & Highlights
Read between
May 17 - May 17, 2018
Classical statistics is at the heart of machine learning and many of these algorithms are based on the same statistical equations you studied in high school. Indeed, statistical algorithms were conducted on paper well before machines ever took on the title of artificial intelligence.
Popular algorithms, such as k-means clustering, association analysis, and regression analysis, are applied in both data mining and machine learning to analyze data. But where machine learning focuses on the incremental process of self-learning and data modeling to form predictions about the future, data mining narrows in on cleaning large datasets to glean valuable insight from the past.
But before we examine specific algorithms, it is important to understand the three overarching categories of machine learning. These three categories are supervised, unsupervised, and reinforcement.
Examples of supervised learning algorithms include regression analysis, decision trees, k-nearest neighbors, neural networks, and support vector machines.
The k-means clustering algorithm is a popular example of unsupervised learning.
Other examples of unsupervised learning include association analysis, social network analysis, and descending dimension algorithms.
Reinforcement learning is the third and most advanced algorithm category in machine learning. Unlike supervised and unsupervised learning, reinforcement learning continuously improves its model by leveraging feedback from previous iterations.
A specific algorithmic example of reinforcement learning is Q-learning. In Q-learning, you start with a set environment of states, represented by the symbol ‘S’. In the game Pac-Man, states could be the challenges, obstacles or pathways that exist in the game. There may exist a wall to the left, a ghost to the right, and a power pill above—each representing different states. The set of possible actions to respond to these states is referred to as “A.” In the case of Pac-Man, actions are limited to left, right, up, and down movements, as well as multiple combinations thereof. The third
...more
Next, Python users will typically import the following libraries: NumPy, Pandas, and Scikit-learn. NumPy is a free and open-source library that allows you to efficiently load and work with large datasets, including managing matrices. Scikit-learn provides access to a range of popular algorithms, including linear regression, Bayes’ classifier, and support vector machines. Finally, Pandas enables your data to be represented on a virtual spreadsheet that you can control through code. It shares many of the same features as Microsoft Excel in that it allows you to edit data and perform
...more
Beginners will typically start off by using simple supervised learning algorithms such as linear regression, logistic regression, decision trees, and k-nearest neighbors. Beginners are also likely to apply unsupervised learning in the form of k-means clustering and descending dimension algorithms.
To analyze large datasets and respond to complicated prediction tasks, advanced learners work with a plethora of algorithms including Markov models, support vector machines, and Q-learning, as well as combinations of algorithms to create a unified model, known as ensemble modeling (which we will explore further in Chapter 12). But the algorithm family they’re most likely to work with is artificial neural networks (introduced in Chapter 10), which comes with its own selection of advanced machine learning libraries.
Popular alternative neural network libraries include Torch, Caffe, and the fast-growing Keras. Written in Python, Keras is an open-source deep learning library that runs on top of TensorFlow, Theano, and other frameworks, and allows users to perform fast experimentation in fewer lines of code.
One means to convert text-based features into numerical values is through one-hot encoding, which transforms features into binary form, represented as “1” or “0”—“True” or “False.”
Binning is another method of feature engineering that is used to convert numerical values into a category.
A common approach to analyzing prediction accuracy is a measure called mean absolute error, which examines each prediction in the model and provides an average error score for each prediction.
Cross validation maximizes the availability of training data by splitting data into various combinations and testing each specific combination. Cross validation can be performed through two primary methods. The first method is exhaustive cross validation, which involves finding and testing all possible combinations to divide the original sample into a training set and a test set. The alternative and more common method is non-exhaustive cross validation, known as k-fold validation. The k-fold validation technique involves splitting data into k assigned buckets and reserving one of those buckets
...more
At a minimum, a machine learning model should typically have ten times as many data points as the total number of features.
The first regression analysis technique that we will examine is linear regression, which uses a straight line to describe a dataset.
The technical term for the regression line is the hyperplane,
Another important feature of regression is slope, which can be conveniently calculated by referencing the hyperplane. As one variable increases, the other variable will increase at the average value denoted by the hyperplane. The slope is therefore very useful in formulating predictions.
Deviation refers to the distance between the hyperplane and the data point.
Logistic regression adopts the sigmoid function to analyze data and predict discrete classes that exist in a dataset. Although logistic regression shares a visual resemblance to linear regression, it is technically a classification technique. Whereas linear regression addresses numerical equations and forms numerical predictions to discern relationships between variables, logistic regression predicts discrete classes.
Logistic regression with more than two outcome values is known as multinomial logistic regression,
As an advanced category of regression, support vector machine (SVM) resembles logistic regression but with stricter conditions. To that end, SVM is superior at drawing classification boundary lines.
The margin is a key feature of SVM and is important because it offers additional support to cope with new data points that may infringe on a logistic regression hyperplane.
Another useful application case of SVM is for mitigating anomalies. A limitation of standard logistic regression is that it goes out of its way to fit anomalies
SVM has numerous variations available to classify high-dimensional data, known as “kernels,” including linear SVC (seen in Figure 12), polynomial SVC, and the Kernel Trick. The Kernel Trick is an advanced solution to map data from a low-dimensional to a high-dimensional space.
In other words, the kernel trick lets you use linear classification techniques to produce a classification that has nonlinear characteristics; a 3D plane forms a linear separator between data points in a 3D space but will form a nonlinear separator between those points when projected into a 2D space.
Clustering analysis falls under the banner of both supervised learning and unsupervised learning. As a supervised learning technique, clustering is used to classify new data points into existing clusters through k-nearest neighbors (k-NN) and as an unsupervised learning technique, clustering is applied to identify discrete groups of data points through k-means clustering. Although there are other forms of clustering techniques, these two algorithms are generally the most popular in both machine learning and data mining.
Although generally a highly accurate and simple technique to learn, storing an entire dataset and calculating the distance between each new data point and all existing data points does place a heavy burden on computing resources. Thus, k-NN is generally not recommended for use with large datasets.
Reducing the total number of dimensions, through a descending dimension algorithm such as Principle Component Analysis (PCA) or merging variables, is a common strategy to simplify and prepare a dataset for k-NN analysis.
As a popular unsupervised learning algorithm, k-means clustering attempts to divide data into k discrete groups and is effective at uncovering basic data patterns.
In order to optimize k, you may wish to turn to a scree plot for guidance. A scree plot charts the degree of scattering (variance) inside a cluster as the total number of clusters increase. Scree plots are famous for their iconic “elbow,” which reflects several pronounced kinks in the plot’s curve. A scree plot compares the Sum of Squared Error (SSE) for each variation of total clusters. SSE is measured as the sum of the squared distance between the centroid and the other neighbors inside the cluster. In a nutshell, SSE drops as more clusters are formed.
Bias refers to the gap between your predicted value and the actual value. In the case of high bias, your predictions are likely to be skewed in a certain direction away from the actual values. Variance describes how scattered your predicted values are.
An overfitted model will yield accurate predictions from the training data but prove less accurate at formulating predictions from the test data.
Underfitting is when your model is overly simple, and again, has not scratched the surface of the underlying patterns in the dataset.
Another effective strategy to combat overfitting and underfitting is to introduce regularization. Regularization artificially amplifies bias error by penalizing an increase in a model’s complexity.
Artificial neural networks, also known as neural networks, is a popular machine learning technique to process data through layers of analysis. The naming of artificial neural networks was inspired by the algorithm’s resemblance to the human brain.
Similar to neurons in the human brain, artificial neural networks are formed by interconnected neurons, also called nodes, which interact with each other through axons, called edges. In a neural network, the nodes are stacked up in layers and generally start with a broad base. The first layer consists of raw data such as numeric values, text, images or sound, which are divided into nodes. Each node then sends information to the next layer of nodes through the network’s edges.
To train the network through supervised learning, the model’s predicted output is compared to the actual output (that is known to be correct) and the difference between these two results is measured and is known as the cost or cost value. The purpose of training is to reduce the cost value until the model’s prediction closely matches the correct output. This is achieved by incrementally tweaking the network’s weights until the lowest possible cost value is obtained. This process of training the neural network is called back-propagation.
The most basic form of a feed-forward neural network is the perceptron.
An alternative to the perceptron is the sigmoid neuron. A sigmoid neuron is very similar to a perceptron, but the presence of a sigmoid function rather than a binary model now accepts any value between 0 and 1.
While more flexible than a perceptron, a sigmoid neuron cannot generate negative values. Hence, a third option is the hyperbolic tangent function.
What makes deep learning “deep” is the stacking of at least 5-10 node layers, with advanced object recognition using upwards of 150 layers. Object recognition, as used by self-driving vehicles to recognize objects such as pedestrians and other vehicles, is a popular application of deep learning today. Other common applications of deep learning include time series analysis to analyze data trends measured over particular time periods or intervals, speech recognition, and text processing tasks including sentiment analysis, topic segmentation, and named entity recognition.
As a supervised learning technique, decision trees are used primarily for solving classification problems, but they can be applied to solve regression problems too.
Classification trees can use quantitative and categorical data to model categorical outcomes. Regression trees also use quantitative and categorical data but instead model quantitative outcomes.
Entropy is a mathematical term that explains the measure of variance in the data among different classes. In simple terms, we want the data at each layer to be more homogenous than at the last. We thus want to pick a “greedy” algorithm that can reduce the level of entropy at each layer of the tree.
Whether it is ID3 or another algorithm, this process of splitting data into binary partitions, known as recursive partitioning, is repeated until a stopping criterion is met.
Rather than striving for the most efficient split at each round of recursive partitioning, an alternative technique is to construct multiple trees and combine their predictions to select an optimal path of classification or prediction. This involves a randomized selection of binary questions to grow multiple different decision trees, known as random forests. In the industry, you will also often hear people refer to this process as “bootstrap aggregating” or “bagging.”
There’s little use in compiling five or ten identical models—there needs to be some element of variation. This is why bootstrap sampling draws on the same dataset but extracts a different variation of the data at each turn.