More on this book
Community
Kindle Notes & Highlights
Datafication is a new term that means that almost every phenomenon is now being observed and stored.
Kryder’s law predicts that the density and capability of hard drive storage media will double every 18 months.
Here are brief descriptions of some of the most important data mining techniques used to generate insights from data. Decision Trees: They help classify populations into classes. It is said that 70% of all data mining work is about classification solutions; and that 70% of all classification work uses decision trees. Thus, decision trees are the most popular and important data mining technique. There are many popular algorithms to make decision trees. They differ in terms of their mechanisms and each technique work well for different situations. It is possible to try multiple decision-tree
...more
This highlight has been truncated due to consecutive passage length restrictions.
The total amount of data in the world is doubling every 18 months.
A decision tree can be mapped to business rules. If the objective function is prediction, then a decision tree or business rules are the most appropriate mode of representing the output.
The output can be in the form of a regression equation or mathematical function that represents the best fitting curve to represent the data.
Population “centroid” is a statistical measure for describing central tendencies of a collection of data points.
Business rules are an appropriate representation of the output of a market basket analysis exercise.
There are two primary kinds of data mining processes: supervised learning and unsupervised learning. In supervised learning, a decision model can be created using past data, and the model can then be used to predict the correct answer for future data instances.
predictive models with more than 70% accuracy can be considered usable in business domains, depending upon the nature of the business.
Data Mining Techniques Supervised Learning (Predictive ability based on past data) Classification – Machine Learning Decision Trees Neural Networks Classification - Statistics Regression Unsupervised Learning (Exploratory analysis to discover patterns) Clustering Analysis Association Rules Figure 4.2: Important Data Mining Techniques
Classification techniques are called supervised learning as there is a way to supervise whether the model is providing the right or wrong answers.
Decision trees are the most popular data mining technique, for many reasons. Decision trees are easy to understand and easy to use, by analysts as well as executives. They also show a high predictive accuracy. Decision trees select the most relevant variables automatically out of all the available variables for decision making. Decision trees are tolerant of data quality issues and do not require much data preparation from the users. Even non-linear relationships can be handled well by decision trees.
Regression is a most popular statistical data mining technique. The goal of regression is to derive a smooth well-defined curve to best the data.
Artificial Neural Networks (ANN) is a sophisticated data mining technique from the Artificial Intelligence stream in Computer Science. It mimics the behavior of human neural structure: Neurons receive stimuli, process them, and communicate their results to other neurons successively, and eventually a neuron outputs a decision. A decision task may be processed by just one neuron and the result may be communicated soon. Alternatively, there could be many layers of neurons involved in a decision task, depending upon the complexity of the domain. The neural network can be trained by making a
...more
At some point, the neural network will have learned enough and begin to match the predictive accuracy of a human expert or alternative classification techniques. The predictions of some ANNs that have been trained over a long period of time with a large amount of data have become decisively more accurate than human experts. At that point, the ANNs can begin to be seriously considered for deployment, in real situations in real time. ANNs are popular because they are eventually able to reach a high predictive accuracy. ANNs are also relatively simple to implement and do not have any issues with
...more
Cluster Analysis is an exploratory learning technique that helps in identifying a set of similar groups in the data. It is a technique used for automatic id...
This highlight has been truncated due to consecutive passage length restrictions.
Clustering is also a part of the artificial intelligence family of techniques.
Association rules are a popular data mining method in business, especially where selling is involved. Also known as market basket analysis, it helps in answering questions about cross-selling opportunities. This is the heart of the personalization engine used by ecommerce sites like Amazon.com and streaming movie sites like Netflix.com. The technique helps find interesting relationships (affinities) between variables (items or events). These are represented as rules of the form X ® Y, where X and Y are sets of data items. A form of unsupervised learning, it has no dependent variable; and
...more
An important element is to go after the problem iteratively. It is better to divide and conquer the problem with smaller amounts of data, and get closer to the heart of the solution in an iterative sequence of steps.
The more data is available for training the decision tree, the more accurate its knowledge extraction will be, and thus, it will make more accurate decisions. The more variables the tree can choose from, the greater is the likely of the accuracy of the decision tree. In addition, a good decision tree should also be frugal so that it takes the least number of questions, and thus, the least amount of effort, to get to the right decision.
A decision tree is a hierarchically branched structure. What should be the first question asked in creating the tree? One should ask the more important question first, and the less important questions later.
It may be possible to increase predictive accuracy by making more sub-trees and making the tree longer. However, the marginal accuracy gained from each subsequent level in the tree will be less, and may not be worth the loss in ease and interpretability of the tree. If the branches are long and complicated, it will be difficult to understand and use. The longer branches may need to be trimmed to keep the tree easy to use.
Decision trees are the most popular, versatile, and easy to use data mining technique with high predictive accuracy. They are also very useful as communication tools with executives. There are many successful decision tree algorithms. All publicly available data mining software platforms offer multiple decision tree implementations.
Regression is a well-known statistical technique to model the predictive relationship between several independent variables (DVs) and one dependent variable. The objective is to find the best-fitting curve for a dependent variable in a multidimensional space, with each independent variable being a dimension.
Regression models are simple, versatile, visual/graphical tools with high predictive ability. They include non-linear as well as binary predictions.
Artificial Neural Networks (ANN) are inspired by the information processing model of the mind/brain. The human brain consists of billions of neurons that link with one another in an intricate pattern. Every neuron receives information from many other neurons, processes it, gets excited or not, and passes its state information to other neurons.
ANNs are composed of a large number of highly interconnected processing elements (neurons) working in a multi-layered structures that receive inputs, process the inputs, and produce an output. An ANN is designed for a specific application, such as pattern recognition or data classification, and trained through a learning process. Just like in biological systems, ANNs make adjustments to the synaptic connections with each learning instance. ANNs are like a black box trained into solving a particular type of problem, and they can develop high predictive powers. Their intermediate synaptic
...more
neuron is the basic processing unit of the network. The neuron (or processing element) receives inputs from its preceding neurons (or PEs), does some nonlinear weighted computation on the basis of those inputs, transforms the result into its output value, and then passes on the output to the next neuron in the network (Figure 8.2). X’s are the inputs, w’s are the weights for each input, and y is the output. Figure 8.2: Model for a single artificial neuron A Neural network is a multi-layered model. There is at least one input neuron, one output neuron, and at least one processing neuron. An ANN
...more
This highlight has been truncated due to consecutive passage length restrictions.
A neural network is a series of neurons that receive inputs from other neurons. They do a weighted summation function of all the inputs, using different weights (or importance) for each input. The weighted sum is then transformed into an output value using a transfer function. Learning in ANN occurs when the various processing elements in the neural network adjust the underlying relationship (weights, transfer function, etc) between input and outputs, in response to the feedback on their predictions. If the prediction made was correct, then the weights would remain the same, but if the
...more
This highlight has been truncated due to consecutive passage length restrictions.
There are many ways to architect the functioning of an ANN using fairly simple and open rules with a tremendous amount of flexibility at each stage. The most popular architecture is a Feed-forward, multi-layered perceptron with back-propagation learning algorithm. That means there are multiple layers of PEs in the system and the output of neurons are fed forward to the PEs in the next layers; and the feedback on the prediction is fed back into the neural network for learning to occur. This is essentially what was described in the earlier paragraphs.
steps required to build an ANN are as follows: Gather data. Divide into training data and test data. The training data needs to be further divided into training data and validation data. Select the network architecture, such as Feedforward network. Select the algorithm, such as Multi-layer Perception. Set network parameters. Train the ANN with training data. Validate the model with validation data. Freeze the weights and other parameters. Test the trained network with test data. Deploy the ANN when it achieves good predictive accuracy.
There are many benefits of using ANN. ANNs impose very little restrictions on their use. ANN can deal with (identify/model) highly nonlinear relationships on their own, without much work from the user or analyst. They help find practical data-driven solutions where algorithmic solutions are non-existent or too complicated. There is no need to program neural networks, as they learn from examples. They get better with use, without much programming effort. They can handle a variety of problem types, including classification, clustering, associations, etc. ANN are tolerant of data quality issues
...more
They are deemed to be black-box solutions, lacking explainability. Thus they are difficult to communicate about, except through the strength of their results. Optimal design of ANN is still an art: it requires expertise and extensive experimentation. It can be difficult to handle a large number of variables (especially the rich nominal attributes). It takes large data sets to train an ANN.
K-means is the most popular clustering algorithm. It iteratively computes the clusters and their centroids. It is a top down approach to clustering.
Cluster analysis is a useful, unsupervised learning technique that is used in many business situations to segment the data into meaningful small groups. K-Means algorithm is an easy statistical technique to iteratively segment the data. However, there is only a heuristic technique to select the right number of clusters.
Associate rule mining is a popular, unsupervised learning technique, used in business to help identify shopping patterns. It is also known as market basket analysis. It helps find interesting relationships (affinities) between variables (items or events). Thus, it can help cross-sell related items and increase the size of a sale.
Apriori Algorithm This is the most popular algorithm used for association rule mining. The objective is to find subsets that are common to at least a minimum number of the itemsets. A frequent itemset is an itemset whose support is greater than or equal to minimum support threshold.
The first level of analysis is identifying frequent words. This creates a bag of important words. Texts
The next level is at the level of identifying meaningful phrases from words.
The next higher level is that of Topics. Multiple phrases could be combined into Topic area.
Text mining is a semi-automated process. Text data needs to be gathered, structured, and then mined, in a 3-step process
Text Mining is a form of data mining. There are many common elements between Text and Data Mining. However, there are some key differences (Table 11.2). The key difference is that text mining requires conversion of text data into frequency data,
Text Mining is diving into the unstructured text to discover valuable insights about the business. The text is gathered and then structured into a term-document matrix based on the frequency of a bag of words in a corpus of documents. The TDM can then be mined for useful, novel patterns, and insights. While the technique is important, the business objective should be well understood and should always be kept in mind.
Naïve Bayes (NB) technique is a supervised learning technique that uses probability-theory-based analysis. It is a machine-learning technique that computes the probabilities of an instance belonging of each one of many target classes, given the prior probabilities of classification using individual factors. Naïve Bayes technique is used often in classifying text documents into one of multiple predefined categories.
Naïve Bayes is a probability-based machine learning technique used for classifications. It is a mathematically simple way to include the contributions of many factors in predicting the class of the next data instance. It is often used to classifying texts.
Support Vector Machine (SVM) is a mathematically rigorous, machine learning technique to build a linear binary classifier. It creates a hyperplane in a high-dimensional space that can accurately slice a dataset into two segments according to the desired objective. The algorithms for developing the classifier can be mathematically challenging though. SVM’s are popular since they are the state-of-the-art for many practical problems, such as identifying spam emails and other text mining applications.
An SVM is a classifier function in a high-dimensional space that defines the decision boundary between two classes. The support vectors are the data points that define the ‘gutters’, or the boundary condition, on either side of the hyperplane, for each of the two classes. The SVM model is thus conceptually easy to understand.
The heart of an SVM algorithm is kernel methods. Most kernel algorithms are based on optimization in a convex space, and are statistically well-founded. Kernel stands for the core, or the germ in a fruit. Kernel methods operate using what is called the ‘kernel trick’. This trick involves to compute and work with the inner products of only the relevant pairs of data in the feature space; they do no need to compute all the data in a high-dimensional feature space. The kernel trick makes the algorithm much less demanding in computational and memory resources.
Kernel methods achieve this by learning from instances. They do not apply some standard computational logic to all the features of each input. Instead they remember each training example, and associate a weight representing its relevance to the achievement of the objective. This could be called instance-based learning. There are several types of Support Vector models including linear, polynomial, RBF, and sigmoid.

