Utilize this easy-to-follow beginner's guide to understand how deep learning can be applied to the task of anomaly detection. Using Keras and PyTorch in Python, the book focuses on how various deep learning models can be applied to semi-supervised and unsupervised anomaly detection tasks.This book begins with an explanation of what anomaly detection is, what it is used for, and its importance. After covering statistical and traditional machine learning methods for anomaly detection using Scikit-Learn in Python, the book then provides an introduction to deep learning with details on how to build and train a deep learning model in both Keras and PyTorch before shifting the focus to applications of the following deep learning models to anomaly various types of Autoencoders, Restricted Boltzmann Machines, RNNs & LSTMs, and Temporal Convolutional Networks. The book explores unsupervised and semi-supervised anomaly detection along with the basics of time series-based anomaly detection.By the end of the book you will have a thorough understanding of the basic task of anomaly detection as well as an assortment of methods to approach anomaly detection, ranging from traditional methods to deep learning. Additionally, you are introduced to Scikit-Learn and are able to create deep learning models in Keras and PyTorch.What You Will LearnUnderstand what anomaly detection is and why it is important in today's worldBecome familiar with statistical and traditional machine learning approaches to anomaly detection using Scikit-LearnKnow the basics of deep learning in Python using Keras and PyTorchBe aware of basic data science concepts for measuring a model's understand what AUC is, what precision and recall mean, and moreApply deep learning to semi-supervised and unsupervised anomaly detectionWho This Book Is ForData scientists and machine learning engineers interested in learning the basics of deep learning applications in anomaly detection
A very good introduction to ML, DL and anomaly detection but with the original sin of a poor pagination and an even poorer graphics design. All in all it does its job in explaining how to deal with anomaly detection, but I'd have liked a little bit more of unsupervised examples, which are the toughest situations to deal with.
The Deep Learning section is very well written, they start from the basics, from the artificial neuron up to state of art like CNNs and GPT.
NOTES Data-based Anomaly Detection Statistical Methods: These methods rely on statistical measures such as mean, standard deviation, or probability distributions to identify anomalies. Examples include z-score, interquartile range (IQR), and Gaussian distribution modeling. Machine Learning Algorithms: Various machine learning algorithms learn patterns from the data and detect anomalies based on deviations from learned patterns. Techniques like decision trees, support vector machines (SVM), isolation forests, and autoencoders fall into this category.
Context-based Anomaly Detection Domain Knowledge: Context-based approaches leverage domain-specific knowledge to identify anomalies. For example, in network security, unusual network traffic patterns may be detected based on knowledge of typical network behavior. Expert Systems: Expert systems use rule-based or knowledge-based systems to detect anomalies based on predefined rules or heuristics derived from domain expertise.
Pattern-based Anomaly Detection Pattern Recognition: Pattern-based approaches focus on identifying deviations from expected patterns within the data. Techniques such as time series analysis, sequence mining, and clustering fall into this category. Deep Learning: Deep learning techniques, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs), can be used for pattern-based anomaly detection by learning complex patterns and detecting deviations from learned representations.
Outlier detection focuses on identifying data points that deviate significantly from the majority of the dataset. These data points are often called outliers. Outliers can be indicative of errors, anomalies, or rare events in the data. Techniques such as statistical methods (e.g., z-score, IQR), machine learning algorithms (e.g., isolation forests, one-class SVM), and clustering can be used for outlier detection.
Novelty detection, also known as one-class classification, involves identifying instances that significantly differ from normal data, without having access to examples of anomalies during training. The goal is to detect novel or unseen patterns in the data. It's particularly useful when anomalies are rare and difficult to obtain labeled data for. Techniques such as support vector machines (SVM) and autoencoders are commonly used for novelty detection.
Event detection aims to identify significant occurrences or events in a dataset, often in real-time or near real-time. These events may represent changes, anomalies, or patterns of interest in the data stream. Event detection is crucial in various domains such as sensor networks, finance, and cybersecurity. Techniques such as time series analysis, signal processing, and machine learning algorithms can be applied for event detection.
Noise removal involves the process of filtering or eliminating unwanted or irrelevant data points from a dataset. Noise can obscure meaningful patterns and distort the analysis results. Techniques such as smoothing filters, wavelet denoising, and outlier detection can be used for noise removal, depending on the nature of the noise and the characteristics of the data.
Traditional ML Algorithms Isolation Forest It is an unsupervised machine learning algorithm used for anomaly detection. It works by isolating anomalies in the data by splitting them from the rest of the data using binary trees. Random Partitioning: Isolation Forest randomly selects a feature and then randomly selects a value within the range of that feature. It then partitions the data based on this randomly selected feature and value. Recursive Partitioning: This process of random partitioning is repeated recursively until all data points are isolated or a predefined maximum depth is reached. Anomaly Score Calculation: Anomalies are expected to be isolated with fewer partitions compared to normal data points. Therefore, anomalies are assigned lower anomaly scores. These scores are based on the average path length required to isolate the data points during the partitioning process. The shorter the path, the more likely it is to be an anomaly. Thresholding: An anomaly threshold is defined, and data points with anomaly scores below this threshold are considered anomalies.
Let's consider a simple example of anomaly detection in a dataset containing information about server response times. The dataset includes features such as CPU usage, memory usage, and network traffic. We want to identify anomalous server responses that indicate potential system failures or cyber attacks. Random Partitioning: In the first iteration, the algorithm randomly selects a feature, let's say CPU usage, and then randomly selects a value within the range of CPU usage, for example, 80%. Based on this random selection, it partitions the data into two groups: data points with CPU usage <= 80% and data points with CPU usage > 80%. Recursive Partitioning: This process is repeated recursively, with random feature and value selections, until each data point is isolated or the maximum depth is reached. Each partitioning step creates a binary tree structure. Anomaly Score Calculation: Anomalies are expected to require fewer partitions to isolate. Therefore, data points that are isolated early in the process (i.e., with shorter average path lengths) are assigned lower anomaly scores. Thresholding: An anomaly threshold is defined based on domain knowledge or validation data. Data points with anomaly scores below this threshold are flagged as anomalies.
One-Class Support Vector Machine One-Class Support Vector Machine (SVM) is a type of support vector machine algorithm that is used for anomaly detection, particularly when dealing with unlabeled data. It is trained on only the normal data instances and aims to create a decision boundary that encapsulates the normal data points, thereby distinguishing them from potential anomalies.
Training Phase: One-Class SVM is trained using only the normal instances (i.e., data points without anomalies). The algorithm aims to find a hyperplane (decision boundary) that best separates the normal data points from the origin in the feature space. Unlike traditional SVM, which aims to maximize the margin between different classes, One-Class SVM aims to enclose as many normal data points as possible within a margin around the decision boundary. Model Representation: The decision boundary created by One-Class SVM is represented by a hyperplane defined by a set of support vectors and a distance parameter called the "nu" parameter. The hyperplane divides the feature space into two regions: the region encapsulating the normal data points (inliers) and the region outside the boundary, which may contain anomalies (outliers). Prediction Phase: During the prediction phase, new data points are evaluated based on their proximity to the decision boundary. Data points falling within the boundary (inside the margin) are classified as normal (inliers). Data points falling outside the boundary (outside the margin) are classified as potential anomalies (outliers).
Hyperparameter Tuning: One-Class SVM typically has a hyperparameter called "nu" that controls the trade-off between maximizing the margin and allowing for violations (i.e., data points classified as outliers). Tuning this hyperparameter is crucial for achieving optimal performance. Scalability: is computationally efficient, particularly when dealing with high-dimensional data or large datasets. However, it may become less effective in extremely high-dimensional spaces. Robustness to Outliers: is inherently robust to outliers in the training data since it learns from only one class. However, it may still misclassify some anomalies that lie close to the decision boundary. Class Imbalance: assumes that the normal class is the minority class, and anomalies are rare. If anomalies are not significantly different from normal instances or if they form a significant portion of the data, One-Class SVM may not perform well.
Deep Learning An artificial neuron, also known as a perceptron, is a fundamental building block of artificial neural networks. It mimics the behavior of biological neurons in the human brain.
The input, a vector from x₁ to x is multiplied element-wise by a weight vector w₁ to w and then summed together. The sum is then offset by a bias term b, and the result passes through an activation function, which is some mathematical function that delivers an output signal based on the magnitude and sign of the input. An example is a simple step function that outputs 1 if the combined input passes a threshold, or 0 otherwise. These now form the outputs, y, to ym. This y-vector can now serve as the input to another neuron.
Input Layer: An artificial neuron typically receives input signals from other neurons or directly from the input features of the data. Each input signal x is associated with a weight w that represents the strength of the connection between the input and the neuron. Weighted Sum: The neuron computes the weighted sum of the input signals and their corresponding weights. The bias term allows the neuron to adjust the decision boundary independently of the input data. Activation Function: The weighted sum z is then passed through an activation function, f(z). It introduces non-linearity into the neuron, enabling it to model complex relationships and learn non-linear patterns in the data. Common activation functions include sigmoid, tanh, ReLU, Leaky ReLU, ELU, etc. Output: The output y of the neuron is the result of applying the activation function to the weighted sum y=f(z) The output of the neuron represents its activation level or firing rate, which is then passed as input to other neurons in the subsequent layers of the neural network. Bias Term: The bias term b is a constant value added to the weighted sum before applying the activation function. It allows the neuron to control the decision boundary independently of the input data. The bias term effectively shifts the activation function horizontally, influencing the threshold at which the neuron fires. Activation Function: introduces non-linearity into the neuron's output. This non-linearity enables the neural network to learn complex relationships and patterns in the data that may not be captured by a simple linear model. The choice of activation function depends on the specific requirements of the task and the characteristics of the data. Output Layer: In a neural network, neurons are organized into layers. The output layer typically consists of one or more neurons that produce the final output of the network. The activation function used in the output layer depends on the nature of the task. For example, sigmoid or softmax functions are commonly used for binary or multi-class classification tasks, while linear functions may be used for regression tasks.
Activation Functions are a way to map the input signals into some form of output signal to be interpreted by the subsequent neurons. They are designed to add non-linearity to the data. If we do not use it, then the output of the affine transformations is just the final output of the neuron. - Sigmoid: The sigmoid activation function squashes the input values between 0 and 1. It has an S-shaped curve and is commonly used in binary classification tasks. However, it suffers from the vanishing gradient problem and is not recommended for deep neural networks. It is appropriate for being used at the very end of a DNN to map the last layer’s raw output into a probability score. - Hyperbolic Tangent (Tanh): Tanh activation function squashes the input values between -1 and 1. Similar to the sigmoid function, it has an S-shaped curve but centered at 0. Tanh is often used in hidden layers of neural networks. - Rectified Linear Unit (ReLU): outputs the input directly if it is positive, otherwise, it outputs zero. It is computationally efficient and helps in mitigating the vanishing gradient problem. ReLU is widely used in deep learning models due to its simplicity and effectiveness. - Leaky ReLU: is similar to ReLU but allows a small, non-zero gradient when the input is negative. This helps prevent dying ReLU neurons, which can occur when a large gradient update causes the neuron to never activate again. - Exponential Linear Unit (ELU): is similar to ReLU for positive input values but smoothly approaches zero for negative input values. It helps in preventing dead neurons and can capture information from negative inputs. - Softmax: is typically used in the output layer of a neural network for multi-class classification tasks. It converts the raw output scores (logits) into probabilities, ensuring that the sum of the probabilities for all classes is equal to 1. Softmax is useful for determining the probability distribution over multiple classes.
A layer in a neural network is a collection of neurons that each compute some output value using the entire input. The output of a layer is comprised of all the output values computed by the neurons within that layer. A neural network is a sequence of layers of neurons where the output of one layer is the input to the next. The first layer of the neural network is the input layer, and it takes in the training data as the input. The last layer of the network is the output layer, and it outputs values that are used as predictions for whatever task the network is being trained to perform. All layers in between are called hidden layers.
This is a thorough investigation into machine learning (ML) applied to anomaly detection. The authors provide a great deal of detail on how to apply particular ML techniques, allowing the diligent reader to follow along to learn the techniques. (I, however, was not that diligent and simply skimmed the details.) This book shines in its detail.
The production quality is less impressive. Apparently the publisher leaves it up to the authors to produce their own graphics, and the authors try but contribution from a professional graphics artist would have benefited the final product.
I would have appreciated some attention to more traditional statistical techniques of anomaly detection to understand how much better the machine learning techniques are. The discussion of the ML techniques from a conceptual basis was not always thoroughly enlightening to me.
This book shines as a practical guide to applying a wide array of machine learning techniques to anomaly detection. It is not a complete guide to the topic, but that was not the author's goal so one should not view that as a criticism but rather an observation.
This is a reasonably useful book. My complains is that it has a number of errors that throw off the reader e.g., errors in examples, making one wonder if one has misunderstood the idea. There are also a number of repeated passages across the chapters, so one finds oneself skipping ahead to avoid these. Many of the images are very poor quality, for no obvious reason.