ElvinOuyang’s Kindle Notes & Highlights for Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

In analytics, it’s more important for individuals to be able to formulate problems well, to prototype solutions quickly, to make reasonable assumptions in the face of ill-structured problems, to design experiments that represent good investments, and to analyze results.

10%

The main difference is that data mining focuses on the automated search for knowledge, patterns, or regularities from data.[10] An important skill for a business analyst is to be able to recognize what sort of analytic technique is appropriate for addressing a particular problem.

11%

A query is a specific request for a subset of data or for statistics about data, formulated in a technical language and posed to a database system.

11%

In contrast, data mining could be used to come up with this query in the first place — as a pattern or regularity in the data.

11%

Data warehouses collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database.

11%

Here we are less interested in explaining a particular dataset as we are in extracting patterns that will generalize to other data, and for the purpose of improving some business process.

11%

The collection of methods for extracting (predictive) models from data, now known as machine learning methods, were developed in several fields contemporaneously, most notably Machine Learning, Applied Statistics, and Pattern Recognition.

11%

The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly.

12%

Following our example of data mining for churn prediction from the first section, we will begin by thinking of predictive modeling as supervised segmentation — how can we segment the population into groups that differ from each other with respect to some quantity of interest. In particular, how can we segment the population with respect to something that we would like to predict or estimate.

12%

What exactly it means to be “informative” varies among applications, but generally, information is a quantity that reduces uncertainty about something.

12%

Having a target variable crystallizes our notion of finding informative attributes: is there one or more other variables that reduces our uncertainty about the value of the target?

12%

One tried-and-true method for analyzing very large datasets is first to select a subset of the data to analyze. Selecting informative attributes provides an “intelligent” method for selecting an informative subset of the data.

12%

a model is a simplified representation of reality created to serve a purpose. It is simplified based on some assumptions about what is and is not important for the specific purpose, or sometimes based on constraints on information or tractability.

12%

In data science, a predictive model is a formula for estimating the unknown value of interest: the target. The formula could be mathematical, or it could be a logical statement such as a rule. Often it is a hybrid of the two.

12%

In data science, prediction more generally means to estimate an unknown value. This value could be something in the future (in common usage, true prediction), but it could also be something in the present or in the past.

13%

Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable.

13%

The creation of models from data is known as model induction. Induction is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths). Our models are general rules in a statistical sense (they usually do not hold 100% of the time; often not nearly), and the procedure that creates the model from the data is called the induction algorithm or learner.

13%

The input data for the induction algorithm, used for inducing the model, are called the training data. As mentioned in Chapter 2, they are called labeled data because the value for the target variable (the label) is known.

13%

An intuitive way of thinking about extracting patterns from data in a supervised manner is to try to segment the population into subgroups that have different values for the target variable (and within the subgroup the instances have similar values for the target variable).

13%

Technically, we would like the resulting groups to be as pure as possible. By pure we mean homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure. If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.

14%

Fortunately, for classification problems we can address all the issues by creating a formula that evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable. Such a formula is based on a purity measure.

14%

Entropy is a measure of disorder that can be applied to a set, such as one of our individual segments. Consider that we have a set of properties of members of the set, and each member has one and only one of the properties. In supervised segmentation, the member properties will correspond to the values of the target variable. Disorder corresponds to how mixed (impure) the segment is with respect to these properties of interest.

definition of entropy

14%

We can see then that entropy measures the general disorder of the set, ranging from zero at minimum disorder (the set has members all with the same, single property) to one at maximal disorder (the properties are equally mixed).

14%

information gain (IG) to measure how much an attribute improves (decreases) entropy over the whole segmentation it creates.

information gain definition

14%

the fundamental idea is important: a natural measure of impurity for numeric values is variance. If the set has all the same values for the numeric target variable, then the set is pure and the variance is zero. If the numeric target values in the set are very different, then the set will have high variance.

14%

To create the best segmentation given a numeric target, we might choose the one that produces the best weighted average variance reduction.

15%

entropy graphs for the mushroom domain (Figure 3-6 through Figure 3-8). Each graph is a two-dimensional description of the entire dataset’s entropy as it is divided in various ways by different attributes. On the x axis is the proportion of the dataset (0 to 1), and on the y axis is the entropy (also 0 to 1) of a given piece of the data. The amount of shaded area in each graph represents the amount of entropy in the dataset when it is divided by some chosen attribute (or not divided, in the case of Figure 3-6). Our goal of having the lowest entropy corresponds to having as little shaded area ...more

definition and usage of entropy graph

16%

There are many techniques to induce a supervised segmentation from a dataset. One of the most popular is to create a tree-structured model (tree induction). These techniques are popular because tree models are easy to understand, and because the induction procedures are elegant (simple to describe) and easy to use. They are robust to many common data problems and are relatively efficient. Most data mining packages include some type of tree induction technique.

17%

In summary, the procedure of classification tree induction is a recursive process of divide and conquer, where the goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible with respect to the target variable. We perform this partitioning recursively, splitting further and further until we are done. We choose the attributes to split upon by testing all of them and selecting whichever yields the purest subgroups. When are we done? (In other words, when do we stop recursing?) It should be clear that we would stop when the nodes are ...more

underlying logic for tree induction

17%

The lines separating the regions are known as decision lines (in two dimensions) or more generally decision surfaces or decision boundaries.

17%

for a problem of n variables, each node of a classification tree imposes an (n–1)-dimensional “hyperplane” decision boundary on the instance space.

17%

If we trace down a single path from the root node to a leaf, collecting the conditions as we go, we generate a rule. Each rule consists of the attribute tests along the path connected with AND.

17%

This is called a frequency-based estimate of class membership probability.

18%

Instead of simply computing the frequency, we would often use a “smoothed” version of the frequency-based estimate, known as the Laplace correction, the purpose of which is to moderate the influence of leaves with only a few instances. The equation for binary class probability estimation becomes: where n is the number of examples in the leaf belonging to class c, and m is the number of examples not belonging to class c.

18%

However, the order in which features are chosen for the tree doesn’t exactly correspond to their ranking in Figure 3-17. Why is this? The answer is that the table ranks each feature by how good it is independently, evaluated separately on the entire population of instances. Nodes in a classification tree depend on the instances above them in the tree. Therefore, except for the root node, features in a classification tree are not evaluated on the entire set of instances. The information gain of a feature depends on the set of instances against which it is evaluated, so the ranking of features ...more

18%

basic concepts of predictive modeling, one of the main tasks of data science, in which a model is built that can estimate the value of a target variable for a new unseen example.

19%

An alternative method for learning a predictive model from a dataset is to start by specifying the structure of the model with certain numeric parameters left unspecified. Then the data mining calculates the best parameter values given a particular set of training data.

Alternative thinking on data mining: parameter learning

19%

This general approach is called parameter learning or parametric modeling.

20%

This is called a linear discriminant because it discriminates between the classes, and the function of the decision boundary is a linear combination — a weighted sum — of the attributes. In the two dimensions of our example, the linear combination corresponds to a line. In three dimensions, the decision boundary is a plane, and in higher dimensions it is a hyperplane

linear discriminant

20%

In Trees as Sets of Rules we showed how a classification tree corresponds to a rule set — a logical classification model of the data. A linear discriminant function is a numeric classification model.

20%

To use this model as a linear discriminant, for a given instance represented by a feature vector x, we check whether f(x) is positive or negative. As discussed above, in the two-dimensional case, this corresponds to seeing whether the instance x falls above or below the line.

20%

Roughly, the larger the magnitude of a feature’s weight, the more important that feature is for classifying the target — assuming all feature values have been normalized to the same range, as mentioned in Sidebar: Simplifying Assumptions in This Chapter. By the same token, if a feature’s weight is near zero the corresponding feature can usually be ignored or discarded.

20%

we need to ask, what should be our goal or objective in choosing the parameters? In our case, this would allow us to answer the question: what weights should we choose? Our general procedure will be to define an objective function that represents our goal, and can be calculated for a particular set of weights and a particular set of data. We will then find the optimal value for the weights by maximizing or minimizing the objective function.

definition of objective function

20%

Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith[22] and experience.

20%

Logistic regression applies linear models to class probability estimation, which is particularly useful for many applications.

21%

Support Vector Machines, Briefly

Support Vector Machines

21%

SVMs choose based on a simple, elegant idea: instead of thinking about separating with a line, first fit the fattest bar between the classes.

definition of SVM

21%

Then once the widest bar is found, the linear discriminant will be the center line through the bar (the solid middle line in Figure 4-8). The distance between the dashed parallel lines is called the margin around the linear discriminant, and thus the objective is to maximize the margin.

See a Problem?

Preview — Data Science for Business by Foster Provost