Paweł Cisło’s Kindle Notes & Highlights for Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

Many odors are completely characteristic of poisonous or edible mushrooms, so odor is a very informative attribute to check when considering mushroom edibility.[17] If you’re going to build a model to determine the mushroom edibility using only a single feature, you should choose its odor.

Mushroom odor

16%

If we select the single variable that gives the most information gain, we create a very simple segmentation. If we select multiple attributes each giving some information gain, it’s not clear how to put them together.

Selecting single vs multiple attributes

16%

Since we are talking about classification, here each leaf contains a classification for its segment. Such a tree is called a classification tree or more loosely a decision tree

Classification/decision tree

16%

leaves of the regression tree contain numeric values.

17%

common form of instance space visualization is a scatterplot on some pair of features, used to compare one variable against another to detect correlations and relationships.

Scatterplot

17%

if a leaf contains n positive instances and m negative instances, the probability of any new instance being positive may be estimated as n/(n+m). This is called a frequency-based estimate of class membership probability.

Frequency-based estimate of class membership probability

18%

“smoothed” version of the frequency-based estimate, known as the Laplace correction, the purpose of which is to moderate the influence of leaves with only a few instances.

Laplace correction - "smoothed" version of the frequency-based estimate

18%

The information gain of a feature depends on the set of instances against which it is evaluated, so the ranking of features for some internal node may not be the same as the global ranking.

Dependence of information gain on the feature

18%

basic measure of attribute information is called information gain, which is based on a purity measure called entropy; another is variance reduction.

Basic measures of attribute information: information gain + variance reduction

19%

goal of the data mining is to tune the parameters so that the model fits the data as well as possible. This general approach is called parameter learning or parametric modeling

Goal of data mining

19%

In certain fields of statistics and econometrics, the bare model with unspecified parameters is called “the model.”

"The model" - bare model with no parameters

20%

This is called a linear discriminant because it discriminates between the classes, and the function of the decision boundary is a linear combination — a weighted sum — of the attributes.

Linear discriminant

20%

Linear functions are one of the workhorses of data science;

20%

going to “fit” this parameterized model to a particular dataset — meaning specifically, to find a good set of weights on the features.

fit - find good set of weights

20%

Our general procedure will be to define an objective function that represents our goal, and can be calculated for a particular set of weights and a particular set of data. We will then find the optimal value for the weights by maximizing or minimizing the objective function.

Procedure of choosing weights

21%

support vector machines are linear discriminants.

SVMs

21%

If the data are not linearly separable, the best fit is some balance between a fat margin and a low total error penalty.

Not linearly separable data

21%

“loss” is used across data science as a general term for error penalty. A loss function determines how much penalty should be assigned to an instance based on the error in the model’s predicted value — in our present context, based on its distance from the separation boundary.

Loss function

21%

Support vector machines use hinge loss, so called because the loss graph looks like a hinge. Hinge loss incurs no penalty for an example that is not on the wrong side of the margin.

Hinge loss

21%

Squared error specifies a loss proportional to the square of the distance from the boundary. Squared error loss usually is used for numeric value prediction (regression), rather than classification.

Squared error

21%

Unfortunately, using squared error for classification also penalizes points far on the correct side of the decision boundary.

Unfortunate use of squared error

21%

Each different linear regression modeling procedure uses one particular choice (and the data scientist should think carefully about whether it is appropriate for the problem).

Choosing different linear regression modeling procedures

21%

an intuitive notion of the fit of the model is: how far away are the estimated values from the true values on the training data?

Intuitive notion of the fit of the model

21%

For a particular training dataset, we could compute this error for each individual data point and sum up the results. Then the model that fits the data best would be the model with the minimum sum of errors on the training data. And that is exactly what regression procedures do.

Regression procedures

21%

The method that is most natural is to simply subtract one from the other (and take the absolute value). So if I predict 10 and the actual value is 12 or 8, I make an error of 2. This is called absolute error, and we could then minimize the sum of absolute errors or equivalently the mean of the absolute errors across the training data.

Absolute error

21%

Standard linear regression procedures instead minimize the sum or mean of the squares of these errors — which gives the procedure its common name “least squares” regression.

Standard linear regression procedures

22%

squared error is particularly convenient mathematically.

Squared error = convenience

22%

analysts often claim to prefer squared error because it strongly penalizes very large errors.

Squared error = strong penalisation of very large errors

22%

systems that build and apply models totally automatically, the modeling needs to be much more robust than when doing a detailed regression analysis “by hand.”

Automatic systems need more robust modeling

22%

more robust modeling procedure (e.g., use as the objective function absolute error instead of squared error).

More robust modeling procedure

22%

linear discriminant could be used to identify accounts or transactions as likely to have been defrauded.

Linear discriminant

22%

linear function f(x) that we’ve examined throughout the chapter is used as a measure of the log-odds of the “event” of interest.

Logistic regression

22%

For probability estimation, logistic regression uses the same linear model as do our linear discriminants for classification and linear regression for estimating numeric target values.

22%

The output of the logistic regression model is interpreted as the log-odds of class membership. These log-odds can be translated directly into the probability of class membership.

Output of logistic regression

22%

It is estimating the log-odds or, more loosely, the probability of class membership (a numeric quantity) over a categorical class.

Logistic regression

23%

This is a direct consequence of the fact that classification trees select a single attribute at a time whereas linear classifiers use a weighted combination of all attributes.

Classification trees vs linear classifiers

24%

The two most common families of techniques that are based on fitting the parameters of complex, nonlinear functions are nonlinear support-vector machines and neural networks

Most popular techniques

24%

Support vector machines have a so-called “kernel function” that maps the original features to some other feature space.

Kernel function in the SVMs

24%

“polynomial kernel,” which essentially means it would consider “higher-order” combinations of the original features (e.g., squared features, products of features).

Polynomial kernel

25%

The parameters now are the coefficients of all the models, taken together.

Parameters = coefficients

25%

as we increase the amount of flexibility we have to fit the data, we increase the chance that we fit the data too well. The model can fit details of its particular training set rather than finding patterns or models that apply more generally.

Increasing flexibility of fitting the data

25%

Finding chance occurrences in data that look like interesting patterns, but which do not generalize, is called overfitting the data.

Overfitting

25%

Generalization is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.

Generalization

25%

Overfitting is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points.

Overfitting

25%

pure memorization, the most extreme overfitting procedure possible.

Pure memorization

26%

there is a fundamental trade-off between model complexity and the possibility of overfitting.

Model complexity vs possibility of overfitting

26%

fitting graph shows the accuracy of a model as a function of complexity.

Fitting graph

26%

concept that is fundamental to evaluation in data science: holdout data.

Holdout data

26%

What we need to do is to “hold out” some data for which we know the value of the target variable, but which will not be used to build the model.

Which data to "hold out"?

26%

As the models get too complex, they look very accurate on the training data, but in fact are overfitting — the training accuracy diverges from the holdout (generalization) accuracy.

Models getting too complex

See a Problem?

Preview — Data Science for Business by Foster Provost