Machine Learning For Absolute Beginners: A Plain English Introduction (Second Edition) (Learn AI & Python for Beginners)
Rate it:
Open Preview
3%
Flag icon
Classical statistics is at the heart of machine learning and many of these algorithms are based on the same statistical equations you studied in high school. Indeed, statistical algorithms were conducted on paper well before machines ever took on the title of artificial intelligence.
10%
Flag icon
Popular algorithms, such as k-means clustering, association analysis, and regression analysis, are applied in both data mining and machine learning to analyze data. But where machine learning focuses on the incremental process of self-learning and data modeling to form predictions about the future, data mining narrows in on cleaning large datasets to glean valuable insight from the past.
11%
Flag icon
But before we examine specific algorithms, it is important to understand the three overarching categories of machine learning. These three categories are supervised, unsupervised, and reinforcement.
12%
Flag icon
Examples of supervised learning algorithms include regression analysis, decision trees, k-nearest neighbors, neural networks, and support vector machines.
12%
Flag icon
The k-means clustering algorithm is a popular example of unsupervised learning.
15%
Flag icon
Other examples of unsupervised learning include association analysis, social network analysis, and descending dimension algorithms.
15%
Flag icon
Reinforcement learning is the third and most advanced algorithm category in machine learning. Unlike supervised and unsupervised learning, reinforcement learning continuously improves its model by leveraging feedback from previous iterations.
16%
Flag icon
A specific algorithmic example of reinforcement learning is Q-learning. In Q-learning, you start with a set environment of states, represented by the symbol ‘S’. In the game Pac-Man, states could be the challenges, obstacles or pathways that exist in the game. There may exist a wall to the left, a ghost to the right, and a power pill above—each representing different states. The set of possible actions to respond to these states is referred to as “A.” In the case of Pac-Man, actions are limited to left, right, up, and down movements, as well as multiple combinations thereof. The third ...more
20%
Flag icon
Next, Python users will typically import the following libraries: NumPy, Pandas, and Scikit-learn. NumPy is a free and open-source library that allows you to efficiently load and work with large datasets, including managing matrices. Scikit-learn provides access to a range of popular algorithms, including linear regression, Bayes’ classifier, and support vector machines. Finally, Pandas enables your data to be represented on a virtual spreadsheet that you can control through code. It shares many of the same features as Microsoft Excel in that it allows you to edit data and perform ...more
21%
Flag icon
Beginners will typically start off by using simple supervised learning algorithms such as linear regression, logistic regression, decision trees, and k-nearest neighbors. Beginners are also likely to apply unsupervised learning in the form of k-means clustering and descending dimension algorithms.
24%
Flag icon
To analyze large datasets and respond to complicated prediction tasks, advanced learners work with a plethora of algorithms including Markov models, support vector machines, and Q-learning, as well as combinations of algorithms to create a unified model, known as ensemble modeling (which we will explore further in Chapter 12). But the algorithm family they’re most likely to work with is artificial neural networks (introduced in Chapter 10), which comes with its own selection of advanced machine learning libraries.
25%
Flag icon
Popular alternative neural network libraries include Torch, Caffe, and the fast-growing Keras. Written in Python, Keras is an open-source deep learning library that runs on top of TensorFlow, Theano, and other frameworks, and allows users to perform fast experimentation in fewer lines of code.
29%
Flag icon
One means to convert text-based features into numerical values is through one-hot encoding, which transforms features into binary form, represented as “1” or “0”—“True” or “False.”
30%
Flag icon
Binning is another method of feature engineering that is used to convert numerical values into a category.
32%
Flag icon
A common approach to analyzing prediction accuracy is a measure called mean absolute error, which examines each prediction in the model and provides an average error score for each prediction.
33%
Flag icon
Cross validation maximizes the availability of training data by splitting data into various combinations and testing each specific combination. Cross validation can be performed through two primary methods. The first method is exhaustive cross validation, which involves finding and testing all possible combinations to divide the original sample into a training set and a test set. The alternative and more common method is non-exhaustive cross validation, known as k-fold validation. The k-fold validation technique involves splitting data into k assigned buckets and reserving one of those buckets ...more
35%
Flag icon
At a minimum, a machine learning model should typically have ten times as many data points as the total number of features.
36%
Flag icon
The first regression analysis technique that we will examine is linear regression, which uses a straight line to describe a dataset.
38%
Flag icon
The technical term for the regression line is the hyperplane,
38%
Flag icon
Another important feature of regression is slope, which can be conveniently calculated by referencing the hyperplane. As one variable increases, the other variable will increase at the average value denoted by the hyperplane. The slope is therefore very useful in formulating predictions.
39%
Flag icon
Deviation refers to the distance between the hyperplane and the data point.
41%
Flag icon
Logistic regression adopts the sigmoid function to analyze data and predict discrete classes that exist in a dataset. Although logistic regression shares a visual resemblance to linear regression, it is technically a classification technique. Whereas linear regression addresses numerical equations and forms numerical predictions to discern relationships between variables, logistic regression predicts discrete classes.
42%
Flag icon
Logistic regression with more than two outcome values is known as multinomial logistic regression,
43%
Flag icon
As an advanced category of regression, support vector machine (SVM) resembles logistic regression but with stricter conditions. To that end, SVM is superior at drawing classification boundary lines.
43%
Flag icon
The margin is a key feature of SVM and is important because it offers additional support to cope with new data points that may infringe on a logistic regression hyperplane.
44%
Flag icon
Another useful application case of SVM is for mitigating anomalies. A limitation of standard logistic regression is that it goes out of its way to fit anomalies
44%
Flag icon
SVM has numerous variations available to classify high-dimensional data, known as “kernels,” including linear SVC (seen in Figure 12), polynomial SVC, and the Kernel Trick. The Kernel Trick is an advanced solution to map data from a low-dimensional to a high-dimensional space.
44%
Flag icon
In other words, the kernel trick lets you use linear classification techniques to produce a classification that has nonlinear characteristics; a 3D plane forms a linear separator between data points in a 3D space but will form a nonlinear separator between those points when projected into a 2D space.
45%
Flag icon
Clustering analysis falls under the banner of both supervised learning and unsupervised learning. As a supervised learning technique, clustering is used to classify new data points into existing clusters through k-nearest neighbors (k-NN) and as an unsupervised learning technique, clustering is applied to identify discrete groups of data points through k-means clustering. Although there are other forms of clustering techniques, these two algorithms are generally the most popular in both machine learning and data mining.
47%
Flag icon
Although generally a highly accurate and simple technique to learn, storing an entire dataset and calculating the distance between each new data point and all existing data points does place a heavy burden on computing resources. Thus, k-NN is generally not recommended for use with large datasets.
47%
Flag icon
Reducing the total number of dimensions, through a descending dimension algorithm such as Principle Component Analysis (PCA) or merging variables, is a common strategy to simplify and prepare a dataset for k-NN analysis.
47%
Flag icon
As a popular unsupervised learning algorithm, k-means clustering attempts to divide data into k discrete groups and is effective at uncovering basic data patterns.
51%
Flag icon
In order to optimize k, you may wish to turn to a scree plot for guidance. A scree plot charts the degree of scattering (variance) inside a cluster as the total number of clusters increase. Scree plots are famous for their iconic “elbow,” which reflects several pronounced kinks in the plot’s curve. A scree plot compares the Sum of Squared Error (SSE) for each variation of total clusters. SSE is measured as the sum of the squared distance between the centroid and the other neighbors inside the cluster. In a nutshell, SSE drops as more clusters are formed.
53%
Flag icon
Bias refers to the gap between your predicted value and the actual value. In the case of high bias, your predictions are likely to be skewed in a certain direction away from the actual values. Variance describes how scattered your predicted values are.
55%
Flag icon
An overfitted model will yield accurate predictions from the training data but prove less accurate at formulating predictions from the test data.
55%
Flag icon
Underfitting is when your model is overly simple, and again, has not scratched the surface of the underlying patterns in the dataset.
56%
Flag icon
Another effective strategy to combat overfitting and underfitting is to introduce regularization. Regularization artificially amplifies bias error by penalizing an increase in a model’s complexity.
56%
Flag icon
Artificial neural networks, also known as neural networks, is a popular machine learning technique to process data through layers of analysis. The naming of artificial neural networks was inspired by the algorithm’s resemblance to the human brain.
57%
Flag icon
Similar to neurons in the human brain, artificial neural networks are formed by interconnected neurons, also called nodes, which interact with each other through axons, called edges. In a neural network, the nodes are stacked up in layers and generally start with a broad base. The first layer consists of raw data such as numeric values, text, images or sound, which are divided into nodes. Each node then sends information to the next layer of nodes through the network’s edges.
57%
Flag icon
To train the network through supervised learning, the model’s predicted output is compared to the actual output (that is known to be correct) and the difference between these two results is measured and is known as the cost or cost value. The purpose of training is to reduce the cost value until the model’s prediction closely matches the correct output. This is achieved by incrementally tweaking the network’s weights until the lowest possible cost value is obtained. This process of training the neural network is called back-propagation.
59%
Flag icon
The most basic form of a feed-forward neural network is the perceptron.
62%
Flag icon
An alternative to the perceptron is the sigmoid neuron. A sigmoid neuron is very similar to a perceptron, but the presence of a sigmoid function rather than a binary model now accepts any value between 0 and 1.
62%
Flag icon
While more flexible than a perceptron, a sigmoid neuron cannot generate negative values. Hence, a third option is the hyperbolic tangent function.
63%
Flag icon
What makes deep learning “deep” is the stacking of at least 5-10 node layers, with advanced object recognition using upwards of 150 layers. Object recognition, as used by self-driving vehicles to recognize objects such as pedestrians and other vehicles, is a popular application of deep learning today. Other common applications of deep learning include time series analysis to analyze data trends measured over particular time periods or intervals, speech recognition, and text processing tasks including sentiment analysis, topic segmentation, and named entity recognition.
64%
Flag icon
As a supervised learning technique, decision trees are used primarily for solving classification problems, but they can be applied to solve regression problems too.
65%
Flag icon
Classification trees can use quantitative and categorical data to model categorical outcomes. Regression trees also use quantitative and categorical data but instead model quantitative outcomes.
66%
Flag icon
Entropy is a mathematical term that explains the measure of variance in the data among different classes. In simple terms, we want the data at each layer to be more homogenous than at the last. We thus want to pick a “greedy” algorithm that can reduce the level of entropy at each layer of the tree.
67%
Flag icon
Whether it is ID3 or another algorithm, this process of splitting data into binary partitions, known as recursive partitioning, is repeated until a stopping criterion is met.
69%
Flag icon
Rather than striving for the most efficient split at each round of recursive partitioning, an alternative technique is to construct multiple trees and combine their predictions to select an optimal path of classification or prediction. This involves a randomized selection of binary questions to grow multiple different decision trees, known as random forests. In the industry, you will also often hear people refer to this process as “bootstrap aggregating” or “bagging.”
69%
Flag icon
There’s little use in compiling five or ten identical models—there needs to be some element of variation. This is why bootstrap sampling draws on the same dataset but extracts a different variation of the data at each turn.
« Prev 1