Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
26%
Flag icon
Generally, there will be more overfitting as one allows the model to be more complex.
Paweł Cisło
More overfitting
26%
Flag icon
complexity of the model; in this case, the number of rows allowed in the table.
Paweł Cisło
Complexity = number of rows allowed in the table
27%
Flag icon
mathematical functions can become more complex is by adding more variables (more attributes).
Paweł Cisło
Mathematical functions becoming complex
27%
Flag icon
SVM tends to be less sensitive to individual examples. The SVM training procedure incorporates complexity control,
Paweł Cisło
SVMs
28%
Flag icon
variance is critical for assessing confidence in the performance estimate,
Paweł Cisło
Variance
28%
Flag icon
cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing.
Paweł Cisło
Cross-validation computing its estimates
28%
Flag icon
Cross-validation begins by splitting a labeled dataset into k partitions called folds. Typically, k will be five or ten.
Paweł Cisło
Cross-validation typically starts with k=5 or 10 folds
28%
Flag icon
the original dataset is split randomly into five equal-sized pieces. Then, each piece is used in turn as the test set, with the other four used to train a model. The result is five different accuracy results, which then can be used to compute the average accuracy and its variance.
Paweł Cisło
Five-fold cross-validation
28%
Flag icon
Each iteration produces one model, and thereby one estimate of generalization performance, for example, one estimate of accuracy. When cross-validation is finished, every example will have been used only once for testing but k–1 times for training.
Paweł Cisło
Iteration
29%
Flag icon
trees may be preferable to logistic regression because of their greater stability and performance.
Paweł Cisło
Trees have more performance/stability over logistic regression
29%
Flag icon
If the training set size changes, you may also expect different generalization performance from the resultant model. All else being equal, the generalization performance of data-driven modeling generally improves as more training data become available, up to a point.
Paweł Cisło
Generalisation performance of data-driven modeling
29%
Flag icon
plot of the generalization performance against the amount of training data is called a learning curve
Paweł Cisło
Learning curve
29%
Flag icon
learning curve shows the generalization performance — the performance only on testing data, plotted against the amount of training data used.
Paweł Cisło
What learning curve shows
29%
Flag icon
fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model complexity
Paweł Cisło
What fitting graph shows
29%
Flag icon
for smaller training-set sizes, logistic regression yields better generalization accuracy than tree induction.
29%
Flag icon
with more flexibility comes more overfitting.
29%
Flag icon
for smaller data, tree induction will tend to overfit more.
29%
Flag icon
logistic regression to perform better for smaller datasets (not always, though).
29%
Flag icon
flexibility of tree induction can be an advantage with la...
This highlight has been truncated due to consecutive passage length restrictions.
29%
Flag icon
tree can represent substantially nonlinear relationships between the fe...
This highlight has been truncated due to consecutive passage length restrictions.
29%
Flag icon
Tree induction commonly uses two techniques to avoid overfitting. These strategies are (i) to stop growing the tree before it gets too complex, and (ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).
Paweł Cisło
Tree induction commonly uses two techniques to avoid overfitting
29%
Flag icon
The simplest method to limit tree size is to specify a minimum number of instances that must be present in a leaf.
Paweł Cisło
Method to limit tree size
29%
Flag icon
hypothesis test tries to assess whether a difference in some statistic is not due simply to chance.
Paweł Cisło
Hypothesis test
29%
Flag icon
alternative to setting a fixed size for the leaves is to conduct a hypothesis test at every leaf to determine whether the observed difference in (say) information gain could have been due to chance.
Paweł Cisło
Hypothesis test at every leaf
29%
Flag icon
general idea is to estimate whether replacing a set of leaves or a branch with a leaf would reduce accuracy. If not, then go ahead and prune. The process can be iterated on progressive subtrees until any removal or replacement would reduce accuracy.
Paweł Cisło
Improving accuracy in trees
30%
Flag icon
take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions. This procedure is often called nested holdout testing.
Paweł Cisło
Holdout testing
30%
Flag icon
Nested cross-validation is more complicated, but it works as you might suspect. Say we would like to do cross-validation to assess the generalization accuracy of a new modeling technique, which has an adjustable complexity parameter C, but we do not know how to set it. So, we run cross-validation as described above.
Paweł Cisło
Nested cross-validation
30%
Flag icon
The only difference from regular cross-validation is that for each fold we first run this experiment to find C, using another, smaller, cross-validation.
Paweł Cisło
Nested cross-validation vs regular one
30%
Flag icon
This idea of using the data to choose the complexity experimentally, as well as to build the resulting model, applies across different induction algorithms and different sorts of complexity.
Paweł Cisło
Idea of using data to choose the complexity experimentally
30%
Flag icon
sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature. After choosing a first feature, SFS tests all models that add a second feature to this first chosen feature. The best pair is then selected. Next the same procedure is done for three, then four, and so on. When adding a feature does not improve classification accuracy on the validation data, the SFS process stops. (There is a similar procedure called sequential backward elimination of features. As you might guess, it ...more
Paweł Cisło
Sequential forward selection (SFS)
30%
Flag icon
Models will be better if they fit the data better, but they also will be better if they are simpler. This general methodology is called regularization
Paweł Cisło
Regularisation
30%
Flag icon
arg maxw just means that you want to maximize the fit over all possible arguments w, and are interested in the particular argument w that gives the maximum. These would be the parameters of the final model.)
Paweł Cisło
Arg maxw
30%
Flag icon
linear support vector machine learning is almost equivalent to the L2-regularized logistic regression just discussed; the only difference is that a support vector machine uses hinge loss instead of likelihood in its optimization.
Paweł Cisło
Linear SVM vs L2-regularised logistic regression
31%
Flag icon
general approach to optimizing the parameter values of a data mining procedure is known as grid search.
Paweł Cisło
Grid search
31%
Flag icon
“the problem of multiple comparisons,” a very important statistical phenomenon that business analysts and data scientists should always keep in mind.
Paweł Cisło
The problem of multiple comparisons
31%
Flag icon
care can be taken to reduce overfitting as much as possible, by using the holdout procedures described in this chapter and if possible by looking carefully at the results before declaring victory.
Paweł Cisło
Reducing overfitting
34%
Flag icon
nearest-neighbor methods often use weighted voting or similarity moderated voting such that each neighbor’s contribution is scaled by its similarity.
Paweł Cisło
Weighted voting or similarity moderated voting
34%
Flag icon
Because no model is built during “training” and most effort is deferred until instances are retrieved, this general idea is known as lazy learning (Aha, 1997).
Paweł Cisło
Lazy learning
34%
Flag icon
If you’re thinking that 1-NN must overfit very strongly, then you are correct.
Paweł Cisło
1-NN
34%
Flag icon
The 1-NN classifier predicts perfectly for training examples, but it also can make an often reasonable prediction on other examples: it uses the most similar training example.
Paweł Cisło
1-NN classifier
35%
Flag icon
k in a k-NN classifier is a complexity parameter.
Paweł Cisło
k = complexity parameter in k-NN
35%
Flag icon
n-NN model (ignoring similarity weighting) simply predicts the average value in the dataset for each case.
Paweł Cisło
n-NN model
35%
Flag icon
we can conduct cross-validation or other nested holdout testing on the training set, for a variety of different values of k, searching for one that gives the best performance on the training data.
35%
Flag icon
Data mining tools usually have the ability to do such nested cross-validation to set k automatically.
Paweł Cisło
Data mining tools
35%
Flag icon
nearest-neighbor “model” consists of the entire case set (the database), the distance function, and the combining function.
Paweł Cisło
Nearest-neighbor model
35%
Flag icon
if model intelligibility and justification are critical, nearest-neighbor methods should be avoided.
Paweł Cisło
When to avoid nearest-neighbor
35%
Flag icon
One benefit of nearest-neighbor methods is that training is very fast because it usually involves only storing the instances.
Paweł Cisło
Training of nearest-neighbor is fast
35%
Flag icon
Some applications require extremely fast predictions; for example, in online advertisement targeting, decisions may need to be made in a few tens of milliseconds. For such applications, a nearest neighbor method may be impractical.
Paweł Cisło
Online advertisement targeting needs fast decisions
36%
Flag icon
Euclidean distance is probably the most widely used distance metric in data science. It is general, intuitive and computationally very fast. Because it employs the squares of the distances along each individual dimension, it is sometimes called the L2 norm and sometimes represented by ||
Paweł Cisło
Euclidean distance
36%
Flag icon
in a nearest-neighbor method the distance function is critical. It basically reduces a comparison of two (potentially complex) examples into a single number.
Paweł Cisło
Importance of distance function in neareast-neighbor