More on this book
Community
Kindle Notes & Highlights
the original dataset is split randomly into five equal-sized pieces. Then, each piece is used in turn as the test set, with the other four used to train a model. The result is five different accuracy results, which then can be used to compute the average accuracy and its variance.
If the training set size changes, you may also expect different generalization performance from the resultant model. All else being equal, the generalization performance of data-driven modeling generally improves as more training data become available, up to a point.
for smaller training-set sizes, logistic regression yields better generalization accuracy than tree induction.
with more flexibility comes more overfitting.
for smaller data, tree induction will tend to overfit more.
logistic regression to perform better for smaller datasets (not always, though).
flexibility of tree induction can be an advantage with la...
This highlight has been truncated due to consecutive passage length restrictions.
tree can represent substantially nonlinear relationships between the fe...
This highlight has been truncated due to consecutive passage length restrictions.
Tree induction commonly uses two techniques to avoid overfitting. These strategies are (i) to stop growing the tree before it gets too complex, and (ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).
take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions. This procedure is often called nested holdout testing.
Nested cross-validation is more complicated, but it works as you might suspect. Say we would like to do cross-validation to assess the generalization accuracy of a new modeling technique, which has an adjustable complexity parameter C, but we do not know how to set it. So, we run cross-validation as described above.
sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature. After choosing a first feature, SFS tests all models that add a second feature to this first chosen feature. The best pair is then selected. Next the same procedure is done for three, then four, and so on. When adding a feature does not improve classification accuracy on the validation data, the SFS process stops. (There is a similar procedure called sequential backward elimination of features. As you might guess, it
...more
we can conduct cross-validation or other nested holdout testing on the training set, for a variety of different values of k, searching for one that gives the best performance on the training data.
Euclidean distance is probably the most widely used distance metric in data science. It is general, intuitive and computationally very fast. Because it employs the squares of the distances along each individual dimension, it is sometimes called the L2 norm and sometimes represented by ||