Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
16%
Flag icon
Many odors are completely characteristic of poisonous or edible mushrooms, so odor is a very informative attribute to check when considering mushroom edibility.[17] If you’re going to build a model to determine the mushroom edibility using only a single feature, you should choose its odor.
Paweł Cisło
Mushroom odor
16%
Flag icon
If we select the single variable that gives the most information gain, we create a very simple segmentation. If we select multiple attributes each giving some information gain, it’s not clear how to put them together.
Paweł Cisło
Selecting single vs multiple attributes
16%
Flag icon
Since we are talking about classification, here each leaf contains a classification for its segment. Such a tree is called a classification tree or more loosely a decision tree
Paweł Cisło
Classification/decision tree
16%
Flag icon
leaves of the regression tree contain numeric values.
17%
Flag icon
common form of instance space visualization is a scatterplot on some pair of features, used to compare one variable against another to detect correlations and relationships.
Paweł Cisło
Scatterplot
17%
Flag icon
if a leaf contains n positive instances and m negative instances, the probability of any new instance being positive may be estimated as n/(n+m). This is called a frequency-based estimate of class membership probability.
Paweł Cisło
Frequency-based estimate of class membership probability
18%
Flag icon
“smoothed” version of the frequency-based estimate, known as the Laplace correction, the purpose of which is to moderate the influence of leaves with only a few instances.
Paweł Cisło
Laplace correction - "smoothed" version of the frequency-based estimate
18%
Flag icon
The information gain of a feature depends on the set of instances against which it is evaluated, so the ranking of features for some internal node may not be the same as the global ranking.
Paweł Cisło
Dependence of information gain on the feature
18%
Flag icon
basic measure of attribute information is called information gain, which is based on a purity measure called entropy; another is variance reduction.
Paweł Cisło
Basic measures of attribute information: information gain + variance reduction
19%
Flag icon
goal of the data mining is to tune the parameters so that the model fits the data as well as possible. This general approach is called parameter learning or parametric modeling
Paweł Cisło
Goal of data mining
19%
Flag icon
In certain fields of statistics and econometrics, the bare model with unspecified parameters is called “the model.”
Paweł Cisło
"The model" - bare model with no parameters
20%
Flag icon
This is called a linear discriminant because it discriminates between the classes, and the function of the decision boundary is a linear combination — a weighted sum — of the attributes.
Paweł Cisło
Linear discriminant
20%
Flag icon
Linear functions are one of the workhorses of data science;
20%
Flag icon
going to “fit” this parameterized model to a particular dataset — meaning specifically, to find a good set of weights on the features.
Paweł Cisło
fit - find good set of weights
20%
Flag icon
Our general procedure will be to define an objective function that represents our goal, and can be calculated for a particular set of weights and a particular set of data. We will then find the optimal value for the weights by maximizing or minimizing the objective function.
Paweł Cisło
Procedure of choosing weights
21%
Flag icon
support vector machines are linear discriminants.
Paweł Cisło
SVMs
21%
Flag icon
If the data are not linearly separable, the best fit is some balance between a fat margin and a low total error penalty.
Paweł Cisło
Not linearly separable data
21%
Flag icon
“loss” is used across data science as a general term for error penalty. A loss function determines how much penalty should be assigned to an instance based on the error in the model’s predicted value — in our present context, based on its distance from the separation boundary.
Paweł Cisło
Loss function
21%
Flag icon
Support vector machines use hinge loss, so called because the loss graph looks like a hinge. Hinge loss incurs no penalty for an example that is not on the wrong side of the margin.
Paweł Cisło
Hinge loss
21%
Flag icon
Squared error specifies a loss proportional to the square of the distance from the boundary. Squared error loss usually is used for numeric value prediction (regression), rather than classification.
Paweł Cisło
Squared error
21%
Flag icon
Unfortunately, using squared error for classification also penalizes points far on the correct side of the decision boundary.
Paweł Cisło
Unfortunate use of squared error
21%
Flag icon
Each different linear regression modeling procedure uses one particular choice (and the data scientist should think carefully about whether it is appropriate for the problem).
Paweł Cisło
Choosing different linear regression modeling procedures
21%
Flag icon
an intuitive notion of the fit of the model is: how far away are the estimated values from the true values on the training data?
Paweł Cisło
Intuitive notion of the fit of the model
21%
Flag icon
For a particular training dataset, we could compute this error for each individual data point and sum up the results. Then the model that fits the data best would be the model with the minimum sum of errors on the training data. And that is exactly what regression procedures do.
Paweł Cisło
Regression procedures
21%
Flag icon
The method that is most natural is to simply subtract one from the other (and take the absolute value). So if I predict 10 and the actual value is 12 or 8, I make an error of 2. This is called absolute error, and we could then minimize the sum of absolute errors or equivalently the mean of the absolute errors across the training data.
Paweł Cisło
Absolute error
21%
Flag icon
Standard linear regression procedures instead minimize the sum or mean of the squares of these errors — which gives the procedure its common name “least squares” regression.
Paweł Cisło
Standard linear regression procedures
22%
Flag icon
squared error is particularly convenient mathematically.
Paweł Cisło
Squared error = convenience
22%
Flag icon
analysts often claim to prefer squared error because it strongly penalizes very large errors.
Paweł Cisło
Squared error = strong penalisation of very large errors
22%
Flag icon
systems that build and apply models totally automatically, the modeling needs to be much more robust than when doing a detailed regression analysis “by hand.”
Paweł Cisło
Automatic systems need more robust modeling
22%
Flag icon
more robust modeling procedure (e.g., use as the objective function absolute error instead of squared error).
Paweł Cisło
More robust modeling procedure
22%
Flag icon
linear discriminant could be used to identify accounts or transactions as likely to have been defrauded.
Paweł Cisło
Linear discriminant
22%
Flag icon
linear function f(x) that we’ve examined throughout the chapter is used as a measure of the log-odds of the “event” of interest.
Paweł Cisło
Logistic regression
22%
Flag icon
For probability estimation, logistic regression uses the same linear model as do our linear discriminants for classification and linear regression for estimating numeric target values.
22%
Flag icon
The output of the logistic regression model is interpreted as the log-odds of class membership. These log-odds can be translated directly into the probability of class membership.
Paweł Cisło
Output of logistic regression
22%
Flag icon
It is estimating the log-odds or, more loosely, the probability of class membership (a numeric quantity) over a categorical class.
Paweł Cisło
Logistic regression
23%
Flag icon
This is a direct consequence of the fact that classification trees select a single attribute at a time whereas linear classifiers use a weighted combination of all attributes.
Paweł Cisło
Classification trees vs linear classifiers
24%
Flag icon
The two most common families of techniques that are based on fitting the parameters of complex, nonlinear functions are nonlinear support-vector machines and neural networks
Paweł Cisło
Most popular techniques
24%
Flag icon
Support vector machines have a so-called “kernel function” that maps the original features to some other feature space.
Paweł Cisło
Kernel function in the SVMs
24%
Flag icon
“polynomial kernel,” which essentially means it would consider “higher-order” combinations of the original features (e.g., squared features, products of features).
Paweł Cisło
Polynomial kernel
25%
Flag icon
The parameters now are the coefficients of all the models, taken together.
Paweł Cisło
Parameters = coefficients
25%
Flag icon
as we increase the amount of flexibility we have to fit the data, we increase the chance that we fit the data too well. The model can fit details of its particular training set rather than finding patterns or models that apply more generally.
Paweł Cisło
Increasing flexibility of fitting the data
25%
Flag icon
Finding chance occurrences in data that look like interesting patterns, but which do not generalize, is called overfitting the data.
Paweł Cisło
Overfitting
25%
Flag icon
Generalization is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.
Paweł Cisło
Generalization
25%
Flag icon
Overfitting is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points.
Paweł Cisło
Overfitting
25%
Flag icon
pure memorization, the most extreme overfitting procedure possible.
Paweł Cisło
Pure memorization
26%
Flag icon
there is a fundamental trade-off between model complexity and the possibility of overfitting.
Paweł Cisło
Model complexity vs possibility of overfitting
26%
Flag icon
fitting graph shows the accuracy of a model as a function of complexity.
Paweł Cisło
Fitting graph
26%
Flag icon
concept that is fundamental to evaluation in data science: holdout data.
Paweł Cisło
Holdout data
26%
Flag icon
What we need to do is to “hold out” some data for which we know the value of the target variable, but which will not be used to build the model.
Paweł Cisło
Which data to "hold out"?
26%
Flag icon
As the models get too complex, they look very accurate on the training data, but in fact are overfitting — the training accuracy diverges from the holdout (generalization) accuracy.
Paweł Cisło
Models getting too complex