Paweł Cisło’s Kindle Notes & Highlights for Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

26%

Generally, there will be more overfitting as one allows the model to be more complex.

More overfitting

26%

complexity of the model; in this case, the number of rows allowed in the table.

Complexity = number of rows allowed in the table

27%

mathematical functions can become more complex is by adding more variables (more attributes).

Mathematical functions becoming complex

27%

SVM tends to be less sensitive to individual examples. The SVM training procedure incorporates complexity control,

SVMs

28%

variance is critical for assessing confidence in the performance estimate,

Variance

28%

cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing.

Cross-validation computing its estimates

28%

Cross-validation begins by splitting a labeled dataset into k partitions called folds. Typically, k will be five or ten.

Cross-validation typically starts with k=5 or 10 folds

28%

the original dataset is split randomly into five equal-sized pieces. Then, each piece is used in turn as the test set, with the other four used to train a model. The result is five different accuracy results, which then can be used to compute the average accuracy and its variance.

Five-fold cross-validation

28%

Each iteration produces one model, and thereby one estimate of generalization performance, for example, one estimate of accuracy. When cross-validation is finished, every example will have been used only once for testing but k–1 times for training.

Iteration

29%

trees may be preferable to logistic regression because of their greater stability and performance.

Trees have more performance/stability over logistic regression

29%

If the training set size changes, you may also expect different generalization performance from the resultant model. All else being equal, the generalization performance of data-driven modeling generally improves as more training data become available, up to a point.

Generalisation performance of data-driven modeling

29%

plot of the generalization performance against the amount of training data is called a learning curve

Learning curve

29%

learning curve shows the generalization performance — the performance only on testing data, plotted against the amount of training data used.

What learning curve shows

29%

fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model complexity

What fitting graph shows

29%

for smaller training-set sizes, logistic regression yields better generalization accuracy than tree induction.

29%

with more flexibility comes more overfitting.

29%

for smaller data, tree induction will tend to overfit more.

29%

logistic regression to perform better for smaller datasets (not always, though).

29%

flexibility of tree induction can be an advantage with la...

This highlight has been truncated due to consecutive passage length restrictions.

29%

tree can represent substantially nonlinear relationships between the fe...

This highlight has been truncated due to consecutive passage length restrictions.

29%

Tree induction commonly uses two techniques to avoid overfitting. These strategies are (i) to stop growing the tree before it gets too complex, and (ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).

Tree induction commonly uses two techniques to avoid overfitting

29%

The simplest method to limit tree size is to specify a minimum number of instances that must be present in a leaf.

Method to limit tree size

29%

hypothesis test tries to assess whether a difference in some statistic is not due simply to chance.

Hypothesis test

29%

alternative to setting a fixed size for the leaves is to conduct a hypothesis test at every leaf to determine whether the observed difference in (say) information gain could have been due to chance.

Hypothesis test at every leaf

29%

general idea is to estimate whether replacing a set of leaves or a branch with a leaf would reduce accuracy. If not, then go ahead and prune. The process can be iterated on progressive subtrees until any removal or replacement would reduce accuracy.

Improving accuracy in trees

30%

take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions. This procedure is often called nested holdout testing.

Holdout testing

30%

Nested cross-validation is more complicated, but it works as you might suspect. Say we would like to do cross-validation to assess the generalization accuracy of a new modeling technique, which has an adjustable complexity parameter C, but we do not know how to set it. So, we run cross-validation as described above.

Nested cross-validation

30%

The only difference from regular cross-validation is that for each fold we first run this experiment to find C, using another, smaller, cross-validation.

Nested cross-validation vs regular one

30%

This idea of using the data to choose the complexity experimentally, as well as to build the resulting model, applies across different induction algorithms and different sorts of complexity.

Idea of using data to choose the complexity experimentally

30%

sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature. After choosing a first feature, SFS tests all models that add a second feature to this first chosen feature. The best pair is then selected. Next the same procedure is done for three, then four, and so on. When adding a feature does not improve classification accuracy on the validation data, the SFS process stops. (There is a similar procedure called sequential backward elimination of features. As you might guess, it ...more

Sequential forward selection (SFS)

30%

Models will be better if they fit the data better, but they also will be better if they are simpler. This general methodology is called regularization

Regularisation

30%

arg maxw just means that you want to maximize the fit over all possible arguments w, and are interested in the particular argument w that gives the maximum. These would be the parameters of the final model.)

Arg maxw

30%

linear support vector machine learning is almost equivalent to the L2-regularized logistic regression just discussed; the only difference is that a support vector machine uses hinge loss instead of likelihood in its optimization.

Linear SVM vs L2-regularised logistic regression

31%

general approach to optimizing the parameter values of a data mining procedure is known as grid search.

Grid search

31%

“the problem of multiple comparisons,” a very important statistical phenomenon that business analysts and data scientists should always keep in mind.

The problem of multiple comparisons

31%

care can be taken to reduce overfitting as much as possible, by using the holdout procedures described in this chapter and if possible by looking carefully at the results before declaring victory.

Reducing overfitting

34%

nearest-neighbor methods often use weighted voting or similarity moderated voting such that each neighbor’s contribution is scaled by its similarity.

Weighted voting or similarity moderated voting

34%

Because no model is built during “training” and most effort is deferred until instances are retrieved, this general idea is known as lazy learning (Aha, 1997).

Lazy learning

34%

If you’re thinking that 1-NN must overfit very strongly, then you are correct.

1-NN

34%

The 1-NN classifier predicts perfectly for training examples, but it also can make an often reasonable prediction on other examples: it uses the most similar training example.

1-NN classifier

35%

k in a k-NN classifier is a complexity parameter.

k = complexity parameter in k-NN

35%

n-NN model (ignoring similarity weighting) simply predicts the average value in the dataset for each case.

n-NN model

35%

we can conduct cross-validation or other nested holdout testing on the training set, for a variety of different values of k, searching for one that gives the best performance on the training data.

35%

Data mining tools usually have the ability to do such nested cross-validation to set k automatically.

Data mining tools

35%

nearest-neighbor “model” consists of the entire case set (the database), the distance function, and the combining function.

Nearest-neighbor model

35%

if model intelligibility and justification are critical, nearest-neighbor methods should be avoided.

When to avoid nearest-neighbor

35%

One benefit of nearest-neighbor methods is that training is very fast because it usually involves only storing the instances.

Training of nearest-neighbor is fast

35%

Some applications require extremely fast predictions; for example, in online advertisement targeting, decisions may need to be made in a few tens of milliseconds. For such applications, a nearest neighbor method may be impractical.

Online advertisement targeting needs fast decisions

36%

Euclidean distance is probably the most widely used distance metric in data science. It is general, intuitive and computationally very fast. Because it employs the squares of the distances along each individual dimension, it is sometimes called the L2 norm and sometimes represented by ||

Euclidean distance

36%

in a nearest-neighbor method the distance function is critical. It basically reduces a comparison of two (potentially complex) examples into a single number.

Importance of distance function in neareast-neighbor

See a Problem?

Preview — Data Science for Business by Foster Provost