More on this book
Community
Kindle Notes & Highlights
Read between
August 30, 2016 - March 24, 2018
The penalty for a misclassified point is proportional to the distance from the margin boundary, so if possible the SVM will make only “small” errors. Technically, this error function is known as hinge loss
An important thing to remember is that once we see linear regression simply as an instance of fitting a (linear) model to data, we see that we have to choose the objective function to optimize — and we should do so with the ultimate business application in mind.
One very useful notion of the likelihood of an event is the odds. The odds of an event is the ratio of the probability of the event occurring to the probability of the event not occurring.
For probability estimation, logistic regression uses the same linear model as do our linear discriminants for classification and linear regression for estimating numeric target values.
Equation 4-3. Log-odds linear function
Equation 4-4. The logistic function
The model (set of weights) that gives the highest sum is the model that gives the highest “likelihood” to the data — the “maximum likelihood” model. The maximum likelihood model “on average” gives the highest probabilities to the positive examples and the lowest probabilities to the negative examples.
The logistic regression procedure then tries to estimate the probabilities (the probability distribution over the instance space) with a linear-log-odds model, based on the observed data on the result of the draws from the distribution.
The two most common families of techniques that are based on fitting the parameters of complex, nonlinear functions are nonlinear support-vector machines and neural networks.
The model can fit details of its particular training set rather than finding patterns or models that apply more generally.
We have also introduced two criteria by which models can be evaluated: the predictive performance of a model and its intelligibility.
Finding chance occurrences in data that look like interesting patterns, but which do not generalize, is called overfitting the data.
Generalization is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.
We may worry that the training data were not representative of the true population, but that is not the problem here. The data were representative, but the data mining did not create a model that generalized beyond the training data.
What we need to do is to “hold out” some data for which we know the value of the target variable, but which will not be used to build the model. These are not the actual use data, for which we ultimately would like to predict the value of the target variable. Instead, creating holdout data is like creating a “lab test” of generalization performance.
Then we estimate the generalization performance by comparing the predicted values with the hidden true values.
Thus, when the holdout data are used in this manner, they often are called the “test set.”
This is known as the base rate, and a classifier that always selects the majority class is called a base rate classifier. A corresponding baseline for a regression model is a simple model that always predicts the mean or median value of the target variable.
Overfitting in Tree Induction
One way mathematical functions can become more complex is by adding more variables (more attributes).
This concept generalizes: as you increase the dimensionality, you can perfectly fit larger and larger sets of arbitrary points. And even if you cannot fit the dataset perfectly, you can fit it better and better with more dimensions — that is, with more attributes.
The fact that the training and deployment populations are different is a likely source of performance degradation.
A plot of the generalization performance against the amount of training data is called a learning curve.
A learning curve shows the generalization performance — the performance only on testing data, plotted against the amount of training data used. A fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model complexity.
Tree induction commonly uses two techniques to avoid overfitting. These strategies are (i) to stop growing the tree before it gets too complex, and (ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).
The simplest method to limit tree size is to specify a minimum number of instances that must be present in a leaf.
So, for stopping tree growth, an alternative to setting a fixed size for the leaves is to conduct a hypothesis test at every leaf to determine whether the observed difference in (say) information gain could have been due to chance. If the hypothesis test concludes that it was likely not due to chance, then the split is accepted and the tree growing continues.
The key is to realize that there was nothing special about the first training/test split we made. Let’s say we are saving the test set for a final assessment. We can take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions. This procedure is often called nested
...more
sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature.
For equations, such as logistic regression, that unlike trees do not automatically select what attributes to include, complexity can be controlled by choosing a “right” set of attributes.
The most commonly used penalty is the sum of the squares of the weights, sometimes called the “L2-norm” of w. The reason is technical, but basically functions can fit data better if they are allowed to have very large positive and negative weights. The sum of the squares of the weights gives a large penalty when weights have large absolute values.
Since these coefficients are the multiplicative weights on the features, L1-regularization effectively performs an automatic form of feature selection.
This cross-validation would essentially conduct automated experiments on subsets of the training data and find a good λ value. Then this λ would be used to learn a regularized model on all the training data. This has become the standard procedure for building numerical models that give a good balance between data fit and model complexity. This general approach to optimizing the parameter values of a data mining procedure is known as grid search.
A fitting graph has two curves showing the model performance on the training and testing data as a function of model complexity.
learning curve shows model performance on testing data plotted against the amount of training data used.
A common experimental methodology called cross-validation specifies a systematic way of splitting up a single dataset such that it generates multiple performance measures.
The general method for reining in model complexity to avoid overfitting is called model regularization. Techniques include tree pruning (cutting a classification tree back when it has become too large), feature selection, and employing explicit complexity penalties into the objective function used for modeling.
given a new example whose target variable we want to predict, we scan through all the training examples and choose several that are the most similar to the new example. Then we predict the new example’s target value, based on the nearest neighbors’ (known) target values.