ElvinOuyang’s Kindle Notes & Highlights for Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

Rate it:

Open Preview

More on this book

Community

Azka

8 notes & 100 highlights

Paweł Cisło

156 notes & 175 highlights

Andrew Sorge

logan

Michael Ross

Juan

antoine pecatikov

Andrés Mise Olivera

Lars

Kindle Notes & Highlights

by ElvinOuyang

See all ElvinOuyang’s Notes & Highlights

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

by Foster Provost

Read between August 30, 2016 - March 24, 2018

21%

The penalty for a misclassified point is proportional to the distance from the margin boundary, so if possible the SVM will make only “small” errors. Technically, this error function is known as hinge loss

21%

For most business problems, choosing squared-error loss for classification or class-probability estimation thus would violate our principle of thinking carefully about whether the loss function is aligned with the business goal.

Squared errors are less helpful to business

22%

An important thing to remember is that once we see linear regression simply as an instance of fitting a (linear) model to data, we see that we have to choose the objective function to optimize — and we should do so with the ultimate business application in mind.

error estimation should rely on business purpose

22%

Class Probability Estimation and Logistic “Regression”

Logistic Regression

22%

One very useful notion of the likelihood of an event is the odds. The odds of an event is the ratio of the probability of the event occurring to the probability of the event not occurring.

22%

taking the logarithm of the odds (called the “log-odds”), since for any number in the range 0 to ∞ its log will be between –∞ to ∞.

log-odds

22%

For probability estimation, logistic regression uses the same linear model as do our linear discriminants for classification and linear regression for estimating numeric target values.

23%

Equation 4-3. Log-odds linear function

23%

Equation 4-4. The logistic function

23%

The model (set of weights) that gives the highest sum is the model that gives the highest “likelihood” to the data — the “maximum likelihood” model. The maximum likelihood model “on average” gives the highest probabilities to the positive examples and the lowest probabilities to the negative examples.

objective model for logistic regression

23%

The logistic regression procedure then tries to estimate the probabilities (the probability distribution over the instance space) with a linear-log-odds model, based on the observed data on the result of the draws from the distribution.

24%

The two most common families of techniques that are based on fitting the parameters of complex, nonlinear functions are nonlinear support-vector machines and neural networks.

24%

Neural networks offer an intriguing twist. One can think of a neural network as a “stack” of models.

Neural network

25%

The model can fit details of its particular training set rather than finding patterns or models that apply more generally.

25%

We have also introduced two criteria by which models can be evaluated: the predictive performance of a model and its intelligibility.

25%

Finding chance occurrences in data that look like interesting patterns, but which do not generalize, is called overfitting the data.

25%

Generalization is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.

25%

We may worry that the training data were not representative of the true population, but that is not the problem here. The data were representative, but the data mining did not create a model that generalized beyond the training data.

26%

The best strategy is to recognize overfitting and to manage complexity in a principled way.

How to deal with overfitting

26%

What we need to do is to “hold out” some data for which we know the value of the target variable, but which will not be used to build the model. These are not the actual use data, for which we ultimately would like to predict the value of the target variable. Instead, creating holdout data is like creating a “lab test” of generalization performance.

definition of holdout data

26%

Then we estimate the generalization performance by comparing the predicted values with the hidden true values.

26%

Thus, when the holdout data are used in this manner, they often are called the “test set.”

26%

This is known as the base rate, and a classifier that always selects the majority class is called a base rate classifier. A corresponding baseline for a regression model is a simple model that always predicts the mean or median value of the target variable.

26%

Overfitting in Tree Induction

26%

Tree induction fitting graph

27%

One way mathematical functions can become more complex is by adding more variables (more attributes).

27%

This concept generalizes: as you increase the dimensionality, you can perfectly fit larger and larger sets of arbitrary points. And even if you cannot fit the dataset perfectly, you can fit it better and better with more dimensions — that is, with more attributes.

28%

Cross-validation is a more sophisticated holdout training and testing procedure.

Cross validation

28%

The fact that the training and deployment populations are different is a likely source of performance degradation.

28%

cross validation illustration

29%

A plot of the generalization performance against the amount of training data is called a learning curve.

29%

A learning curve shows the generalization performance — the performance only on testing data, plotted against the amount of training data used. A fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model complexity.

29%

Tree induction commonly uses two techniques to avoid overfitting. These strategies are (i) to stop growing the tree before it gets too complex, and (ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).

29%

The simplest method to limit tree size is to specify a minimum number of instances that must be present in a leaf.

29%

So, for stopping tree growth, an alternative to setting a fixed size for the leaves is to conduct a hypothesis test at every leaf to determine whether the observed difference in (say) information gain could have been due to chance. If the hypothesis test concludes that it was likely not due to chance, then the split is accepted and the tree growing continues.

30%

The key is to realize that there was nothing special about the first training/test split we made. Let’s say we are saving the test set for a final assessment. We can take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions. This procedure is often called nested ...more

nested holdout testing to achieve best complexity before running on test set to build model

30%

sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature.

30%

For equations, such as logistic regression, that unlike trees do not automatically select what attributes to include, complexity can be controlled by choosing a “right” set of attributes.

30%

Models will be better if they fit the data better, but they also will be better if they are simpler. This general methodology is called regularization, a term that is heard often in data science discussions.

regularization

30%

The most commonly used penalty is the sum of the squares of the weights, sometimes called the “L2-norm” of w. The reason is technical, but basically functions can fit data better if they are allowed to have very large positive and negative weights. The sum of the squares of the weights gives a large penalty when weights have large absolute values.

L2 Norm for model simplicity control

30%

Since these coefficients are the multiplicative weights on the features, L1-regularization effectively performs an automatic form of feature selection.

31%

This cross-validation would essentially conduct automated experiments on subsets of the training data and find a good λ value. Then this λ would be used to learn a regularized model on all the training data. This has become the standard procedure for building numerical models that give a good balance between data fit and model complexity. This general approach to optimizing the parameter values of a data mining procedure is known as grid search.

grid search

31%

A fitting graph has two curves showing the model performance on the training and testing data as a function of model complexity.

31%

learning curve shows model performance on testing data plotted against the amount of training data used.

31%

A common experimental methodology called cross-validation specifies a systematic way of splitting up a single dataset such that it generates multiple performance measures.

31%

The general method for reining in model complexity to avoid overfitting is called model regularization. Techniques include tree pruning (cutting a classification tree back when it has become too large), feature selection, and employing explicit complexity penalties into the objective function used for modeling.

32%

Equation 6-1. General Euclidean distance

Euclidean distance calculation

33%

If you’re interested in playing with the Scotch Whiskey dataset, Lapointe and Legendre have made their data and paper available at: http://adn.biol.umontreal.ca/~numericalecology/data/scotch.html .

run a visualization report on the wine dataset

33%

given a new example whose target variable we want to predict, we scan through all the training examples and choose several that are the most similar to the new example. Then we predict the new example’s target value, based on the nearest neighbors’ (known) target values.

34%

Nearest neighbor algorithms are often referred to by the shorthand k-NN, where the k refers to the number of neighbors used, such as 3-NN.

KNN model

« Prev 1 2 3 4 Next »

See a Problem?

Preview — Data Science for Business by Foster Provost