Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
10%
Flag icon
In analytics, it’s more important for individuals to be able to formulate problems well, to prototype solutions quickly, to make reasonable assumptions in the face of ill-structured problems, to design experiments that represent good investments, and to analyze results.
10%
Flag icon
The main difference is that data mining focuses on the automated search for knowledge, patterns, or regularities from data.[10] An important skill for a business analyst is to be able to recognize what sort of analytic technique is appropriate for addressing a particular problem.
11%
Flag icon
A query is a specific request for a subset of data or for statistics about data, formulated in a technical language and posed to a database system.
11%
Flag icon
In contrast, data mining could be used to come up with this query in the first place — as a pattern or regularity in the data.
11%
Flag icon
Data warehouses collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database.
11%
Flag icon
Here we are less interested in explaining a particular dataset as we are in extracting patterns that will generalize to other data, and for the purpose of improving some business process.
11%
Flag icon
The collection of methods for extracting (predictive) models from data, now known as machine learning methods, were developed in several fields contemporaneously, most notably Machine Learning, Applied Statistics, and Pattern Recognition.
11%
Flag icon
The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly.
12%
Flag icon
Following our example of data mining for churn prediction from the first section, we will begin by thinking of predictive modeling as supervised segmentation — how can we segment the population into groups that differ from each other with respect to some quantity of interest. In particular, how can we segment the population with respect to something that we would like to predict or estimate.
12%
Flag icon
What exactly it means to be “informative” varies among applications, but generally, information is a quantity that reduces uncertainty about something.
12%
Flag icon
Having a target variable crystallizes our notion of finding informative attributes: is there one or more other variables that reduces our uncertainty about the value of the target?
12%
Flag icon
One tried-and-true method for analyzing very large datasets is first to select a subset of the data to analyze. Selecting informative attributes provides an “intelligent” method for selecting an informative subset of the data.
12%
Flag icon
a model is a simplified representation of reality created to serve a purpose. It is simplified based on some assumptions about what is and is not important for the specific purpose, or sometimes based on constraints on information or tractability.
12%
Flag icon
In data science, a predictive model is a formula for estimating the unknown value of interest: the target. The formula could be mathematical, or it could be a logical statement such as a rule. Often it is a hybrid of the two.
12%
Flag icon
In data science, prediction more generally means to estimate an unknown value. This value could be something in the future (in common usage, true prediction), but it could also be something in the present or in the past.
13%
Flag icon
Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable.
13%
Flag icon
The creation of models from data is known as model induction. Induction is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths). Our models are general rules in a statistical sense (they usually do not hold 100% of the time; often not nearly), and the procedure that creates the model from the data is called the induction algorithm or learner.
13%
Flag icon
The input data for the induction algorithm, used for inducing the model, are called the training data. As mentioned in Chapter 2, they are called labeled data because the value for the target variable (the label) is known.
13%
Flag icon
An intuitive way of thinking about extracting patterns from data in a supervised manner is to try to segment the population into subgroups that have different values for the target variable (and within the subgroup the instances have similar values for the target variable).
13%
Flag icon
Technically, we would like the resulting groups to be as pure as possible. By pure we mean homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure. If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
14%
Flag icon
Fortunately, for classification problems we can address all the issues by creating a formula that evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable. Such a formula is based on a purity measure.
14%
Flag icon
Entropy is a measure of disorder that can be applied to a set, such as one of our individual segments. Consider that we have a set of properties of members of the set, and each member has one and only one of the properties. In supervised segmentation, the member properties will correspond to the values of the target variable. Disorder corresponds to how mixed (impure) the segment is with respect to these properties of interest.
ElvinOuyang
definition of entropy
14%
Flag icon
14%
Flag icon
We can see then that entropy measures the general disorder of the set, ranging from zero at minimum disorder (the set has members all with the same, single property) to one at maximal disorder (the properties are equally mixed).
14%
Flag icon
information gain (IG) to measure how much an attribute improves (decreases) entropy over the whole segmentation it creates.
ElvinOuyang
information gain definition
14%
Flag icon
14%
Flag icon
the fundamental idea is important: a natural measure of impurity for numeric values is variance. If the set has all the same values for the numeric target variable, then the set is pure and the variance is zero. If the numeric target values in the set are very different, then the set will have high variance.
14%
Flag icon
To create the best segmentation given a numeric target, we might choose the one that produces the best weighted average variance reduction.
15%
Flag icon
entropy graphs for the mushroom domain (Figure 3-6 through Figure 3-8). Each graph is a two-dimensional description of the entire dataset’s entropy as it is divided in various ways by different attributes. On the x axis is the proportion of the dataset (0 to 1), and on the y axis is the entropy (also 0 to 1) of a given piece of the data. The amount of shaded area in each graph represents the amount of entropy in the dataset when it is divided by some chosen attribute (or not divided, in the case of Figure 3-6). Our goal of having the lowest entropy corresponds to having as little shaded area ...more
ElvinOuyang
definition and usage of entropy graph
16%
Flag icon
There are many techniques to induce a supervised segmentation from a dataset. One of the most popular is to create a tree-structured model (tree induction). These techniques are popular because tree models are easy to understand, and because the induction procedures are elegant (simple to describe) and easy to use. They are robust to many common data problems and are relatively efficient. Most data mining packages include some type of tree induction technique.
17%
Flag icon
In summary, the procedure of classification tree induction is a recursive process of divide and conquer, where the goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible with respect to the target variable. We perform this partitioning recursively, splitting further and further until we are done. We choose the attributes to split upon by testing all of them and selecting whichever yields the purest subgroups. When are we done? (In other words, when do we stop recursing?) It should be clear that we would stop when the nodes are ...more
ElvinOuyang
underlying logic for tree induction
17%
Flag icon
The lines separating the regions are known as decision lines (in two dimensions) or more generally decision surfaces or decision boundaries.
17%
Flag icon
for a problem of n variables, each node of a classification tree imposes an (n–1)-dimensional “hyperplane” decision boundary on the instance space.
17%
Flag icon
If we trace down a single path from the root node to a leaf, collecting the conditions as we go, we generate a rule. Each rule consists of the attribute tests along the path connected with AND.
17%
Flag icon
This is called a frequency-based estimate of class membership probability.
18%
Flag icon
Instead of simply computing the frequency, we would often use a “smoothed” version of the frequency-based estimate, known as the Laplace correction, the purpose of which is to moderate the influence of leaves with only a few instances. The equation for binary class probability estimation becomes: where n is the number of examples in the leaf belonging to class c, and m is the number of examples not belonging to class c.
18%
Flag icon
However, the order in which features are chosen for the tree doesn’t exactly correspond to their ranking in Figure 3-17. Why is this? The answer is that the table ranks each feature by how good it is independently, evaluated separately on the entire population of instances. Nodes in a classification tree depend on the instances above them in the tree. Therefore, except for the root node, features in a classification tree are not evaluated on the entire set of instances. The information gain of a feature depends on the set of instances against which it is evaluated, so the ranking of features ...more
18%
Flag icon
basic concepts of predictive modeling, one of the main tasks of data science, in which a model is built that can estimate the value of a target variable for a new unseen example.
19%
Flag icon
An alternative method for learning a predictive model from a dataset is to start by specifying the structure of the model with certain numeric parameters left unspecified. Then the data mining calculates the best parameter values given a particular set of training data.
ElvinOuyang
Alternative thinking on data mining: parameter learning
19%
Flag icon
This general approach is called parameter learning or parametric modeling.
20%
Flag icon
This is called a linear discriminant because it discriminates between the classes, and the function of the decision boundary is a linear combination — a weighted sum — of the attributes. In the two dimensions of our example, the linear combination corresponds to a line. In three dimensions, the decision boundary is a plane, and in higher dimensions it is a hyperplane
ElvinOuyang
linear discriminant
20%
Flag icon
In Trees as Sets of Rules we showed how a classification tree corresponds to a rule set — a logical classification model of the data. A linear discriminant function is a numeric classification model.
20%
Flag icon
To use this model as a linear discriminant, for a given instance represented by a feature vector x, we check whether f(x) is positive or negative. As discussed above, in the two-dimensional case, this corresponds to seeing whether the instance x falls above or below the line.
20%
Flag icon
Roughly, the larger the magnitude of a feature’s weight, the more important that feature is for classifying the target — assuming all feature values have been normalized to the same range, as mentioned in Sidebar: Simplifying Assumptions in This Chapter. By the same token, if a feature’s weight is near zero the corresponding feature can usually be ignored or discarded.
20%
Flag icon
we need to ask, what should be our goal or objective in choosing the parameters? In our case, this would allow us to answer the question: what weights should we choose? Our general procedure will be to define an objective function that represents our goal, and can be calculated for a particular set of weights and a particular set of data. We will then find the optimal value for the weights by maximizing or minimizing the objective function.
ElvinOuyang
definition of objective function
20%
Flag icon
Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith[22] and experience.
20%
Flag icon
Logistic regression applies linear models to class probability estimation, which is particularly useful for many applications.
21%
Flag icon
Support Vector Machines, Briefly
ElvinOuyang
Support Vector Machines
21%
Flag icon
SVMs choose based on a simple, elegant idea: instead of thinking about separating with a line, first fit the fattest bar between the classes.
ElvinOuyang
definition of SVM
21%
Flag icon
Then once the widest bar is found, the linear discriminant will be the center line through the bar (the solid middle line in Figure 4-8). The distance between the dashed parallel lines is called the margin around the linear discriminant, and thus the objective is to maximize the margin.