Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
21%
Flag icon
The penalty for a misclassified point is proportional to the distance from the margin boundary, so if possible the SVM will make only “small” errors. Technically, this error function is known as hinge loss
21%
Flag icon
For most business problems, choosing squared-error loss for classification or class-probability estimation thus would violate our principle of thinking carefully about whether the loss function is aligned with the business goal.
ElvinOuyang
Squared errors are less helpful to business
22%
Flag icon
An important thing to remember is that once we see linear regression simply as an instance of fitting a (linear) model to data, we see that we have to choose the objective function to optimize — and we should do so with the ultimate business application in mind.
ElvinOuyang
error estimation should rely on business purpose
22%
Flag icon
Class Probability Estimation and Logistic “Regression”
ElvinOuyang
Logistic Regression
22%
Flag icon
One very useful notion of the likelihood of an event is the odds. The odds of an event is the ratio of the probability of the event occurring to the probability of the event not occurring.
22%
Flag icon
taking the logarithm of the odds (called the “log-odds”), since for any number in the range 0 to ∞ its log will be between –∞ to ∞.
ElvinOuyang
log-odds
22%
Flag icon
For probability estimation, logistic regression uses the same linear model as do our linear discriminants for classification and linear regression for estimating numeric target values.
23%
Flag icon
Equation 4-3. Log-odds linear function
23%
Flag icon
Equation 4-4. The logistic function
23%
Flag icon
The model (set of weights) that gives the highest sum is the model that gives the highest “likelihood” to the data — the “maximum likelihood” model. The maximum likelihood model “on average” gives the highest probabilities to the positive examples and the lowest probabilities to the negative examples.
ElvinOuyang
objective model for logistic regression
23%
Flag icon
The logistic regression procedure then tries to estimate the probabilities (the probability distribution over the instance space) with a linear-log-odds model, based on the observed data on the result of the draws from the distribution.
24%
Flag icon
The two most common families of techniques that are based on fitting the parameters of complex, nonlinear functions are nonlinear support-vector machines and neural networks.
24%
Flag icon
Neural networks offer an intriguing twist. One can think of a neural network as a “stack” of models.
ElvinOuyang
Neural network
25%
Flag icon
The model can fit details of its particular training set rather than finding patterns or models that apply more generally.
25%
Flag icon
We have also introduced two criteria by which models can be evaluated: the predictive performance of a model and its intelligibility.
25%
Flag icon
Finding chance occurrences in data that look like interesting patterns, but which do not generalize, is called overfitting the data.
25%
Flag icon
Generalization is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.
25%
Flag icon
We may worry that the training data were not representative of the true population, but that is not the problem here. The data were representative, but the data mining did not create a model that generalized beyond the training data.
26%
Flag icon
The best strategy is to recognize overfitting and to manage complexity in a principled way.
ElvinOuyang
How to deal with overfitting
26%
Flag icon
What we need to do is to “hold out” some data for which we know the value of the target variable, but which will not be used to build the model. These are not the actual use data, for which we ultimately would like to predict the value of the target variable. Instead, creating holdout data is like creating a “lab test” of generalization performance.
ElvinOuyang
definition of holdout data
26%
Flag icon
Then we estimate the generalization performance by comparing the predicted values with the hidden true values.
26%
Flag icon
Thus, when the holdout data are used in this manner, they often are called the “test set.”
26%
Flag icon
This is known as the base rate, and a classifier that always selects the majority class is called a base rate classifier. A corresponding baseline for a regression model is a simple model that always predicts the mean or median value of the target variable.
26%
Flag icon
Overfitting in Tree Induction
26%
Flag icon
ElvinOuyang
Tree induction fitting graph
27%
Flag icon
One way mathematical functions can become more complex is by adding more variables (more attributes).
27%
Flag icon
This concept generalizes: as you increase the dimensionality, you can perfectly fit larger and larger sets of arbitrary points. And even if you cannot fit the dataset perfectly, you can fit it better and better with more dimensions — that is, with more attributes.
28%
Flag icon
Cross-validation is a more sophisticated holdout training and testing procedure.
ElvinOuyang
Cross validation
28%
Flag icon
The fact that the training and deployment populations are different is a likely source of performance degradation.
28%
Flag icon
ElvinOuyang
cross validation illustration
29%
Flag icon
A plot of the generalization performance against the amount of training data is called a learning curve.
29%
Flag icon
A learning curve shows the generalization performance — the performance only on testing data, plotted against the amount of training data used. A fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model complexity.
29%
Flag icon
Tree induction commonly uses two techniques to avoid overfitting. These strategies are (i) to stop growing the tree before it gets too complex, and (ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).
29%
Flag icon
The simplest method to limit tree size is to specify a minimum number of instances that must be present in a leaf.
29%
Flag icon
So, for stopping tree growth, an alternative to setting a fixed size for the leaves is to conduct a hypothesis test at every leaf to determine whether the observed difference in (say) information gain could have been due to chance. If the hypothesis test concludes that it was likely not due to chance, then the split is accepted and the tree growing continues.
30%
Flag icon
The key is to realize that there was nothing special about the first training/test split we made. Let’s say we are saving the test set for a final assessment. We can take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions. This procedure is often called nested ...more
ElvinOuyang
nested holdout testing to achieve best complexity before running on test set to build model
30%
Flag icon
sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature.
30%
Flag icon
For equations, such as logistic regression, that unlike trees do not automatically select what attributes to include, complexity can be controlled by choosing a “right” set of attributes.
30%
Flag icon
Models will be better if they fit the data better, but they also will be better if they are simpler. This general methodology is called regularization, a term that is heard often in data science discussions.
ElvinOuyang
regularization
30%
Flag icon
The most commonly used penalty is the sum of the squares of the weights, sometimes called the “L2-norm” of w. The reason is technical, but basically functions can fit data better if they are allowed to have very large positive and negative weights. The sum of the squares of the weights gives a large penalty when weights have large absolute values.
ElvinOuyang
L2 Norm for model simplicity control
30%
Flag icon
Since these coefficients are the multiplicative weights on the features, L1-regularization effectively performs an automatic form of feature selection.
31%
Flag icon
This cross-validation would essentially conduct automated experiments on subsets of the training data and find a good λ value. Then this λ would be used to learn a regularized model on all the training data. This has become the standard procedure for building numerical models that give a good balance between data fit and model complexity. This general approach to optimizing the parameter values of a data mining procedure is known as grid search.
ElvinOuyang
grid search
31%
Flag icon
A fitting graph has two curves showing the model performance on the training and testing data as a function of model complexity.
31%
Flag icon
learning curve shows model performance on testing data plotted against the amount of training data used.
31%
Flag icon
A common experimental methodology called cross-validation specifies a systematic way of splitting up a single dataset such that it generates multiple performance measures.
31%
Flag icon
The general method for reining in model complexity to avoid overfitting is called model regularization. Techniques include tree pruning (cutting a classification tree back when it has become too large), feature selection, and employing explicit complexity penalties into the objective function used for modeling.
32%
Flag icon
Equation 6-1. General Euclidean distance
ElvinOuyang
Euclidean distance calculation
33%
Flag icon
If you’re interested in playing with the Scotch Whiskey dataset, Lapointe and Legendre have made their data and paper available at: http://adn.biol.umontreal.ca/~numericalecology/data/scotch.html .
ElvinOuyang
run a visualization report on the wine dataset
33%
Flag icon
given a new example whose target variable we want to predict, we scan through all the training examples and choose several that are the most similar to the new example. Then we predict the new example’s target value, based on the nearest neighbors’ (known) target values.
34%
Flag icon
Nearest neighbor algorithms are often referred to by the shorthand k-NN, where the k refers to the number of neighbors used, such as 3-NN.
ElvinOuyang
KNN model