More on this book
Community
Kindle Notes & Highlights
Read between
February 19 - June 10, 2020
Classification Trees
classification tree is perhaps the simplest form of algorithm, since it consists of a series of yes/no questions, the answer to each deciding the next question to be asked, until a conclusion is reached.
Assessing the Performance of an Algorithm
the error matrix, or sometimes the confusion matrix.
the percentage of true survivors that are correctly predicted is known as the sensitivity of the algorithm, while the percentage of true non-survivors that are correctly predicted is known as the specificity.
it is a very crude measure of performance and takes no account of the confidence with which a prediction is made.
Algorithms that give a probability (or any number) rather than a simple classification are often compared using Receiver Operating Characteristic (ROC) curves,
By considering all possible thresholds for predicting a survivor, the possible values for the specificity and sensitivity form a curve.
Figure 6.4 shows the ROC curves for training and test sets. A completely useless algorithm that assigns numbers at random would have a diagonal ROC curve, whereas the best algorithms will have ROC curves that move towards the top-left corner.
The area under the ROC curve is one way of measuring how well an algorithm splits the survivors from the non-survivors, but it does not measure how good the probabilities are.
How do we know how good ‘probability of precipitation’ forecasts are?
probabilistic forecast, the model has to be run many times starting at slightly adjusted initial conditions, which produces a list of different ‘possible futures’,
in some of which it rains and in some it doesn’t. Forecasters run an ‘ensemble’ of, say, fifty models, and if it rains in five of those possible futures in a particular place and time, they claim a ‘probability of precipitation’
But how do we check how good these prob...
This highlight has been truncated due to consecutive passage length restrictions.
Calibration plots allow us to see how reliable the stated probabilities are, by collecting together, say, the events given a particular probability of occurrence, and calculating the proportion of such events that actually occurred.
Figure 6.5 shows the calibration plot for the simple classification tree applied to the test set.
A Combined Measure of ‘Accuracy’ for Probabilities
While the ROC curve assesses how well the algorithm splits the groups, and the calibration plot checks whether the probabilities mean what they say, it would be best to find a simple composite measure that combines both aspects into a single number we could use to compare algorithms. Fortunately weather forecasters back in the 1950s worked out exactly how to do this.
The usual summary of the error over a number of days is the mean-squared-error (MSE)—this is the average of the squares of the errors, and is analogous to the least-squares criterion we saw used in regression analysis.
The trick for probabilities is to use the same mean-squared-error criterion as when predicting a quantity, but treating a future observation of ‘rain’ as taking on the value 1, and ‘no rain’ as being
The average mean-squared-error is known as the Brier score, after meteorologist Glenn Brier, who described the method in 1950.
Unfortunately the Brier score is not easy to interpret on its own, and so it is difficult to get a feeling of whether any forecaster is doing well or badly; it is therefore best to compare it with a reference score derived from historical climate records.
Table 6.2
Forecasters then create a ‘skill score’, which is the proportional reduction of the reference score: in our case, 0.61,* meaning our algorithm has made a 61% improvement on a naïve forecaster who uses only climate data.
Over-fitting
This is known as over-fitting, and is one of the most vital topics in algorithm construction. By making an algorithm too complex, we essentially start fitting the noise rather than the signal.
We over-fit when we go too far in adapting to local circumstances, in a worthy but misguided effort to be ‘unbiased’ and take into account all the available information.
Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off.
perhaps the most common protection is to use the simple but powerful idea of cross-validation
Regression Models
More Complex Techniques
This reflects a general concern that algorithms that win Kaggle competitions tend to be very complex in order to achieve that tiny final margin needed to win. A major problem is that these algorithms tend to be inscrutable black boxes—they come up with a prediction, but it is almost impossible to work out what is going on inside. This has three negative aspects.
First, extreme complexity makes implementation and upgrading a great effort:
The second negative feature is that we do not know how the conclusion was arrived at, or what confidence we should have in it: we just have to take it or leave it.
Finally, if we do not know how an algorithm is producing its answer, we cannot investigate it for implicit but systematic biases against some members of the community—
once performance is ‘good enough’, it may be reasonable to trade off further small increases for the need to retain simplicity.
Challenges of Algorithms
Four main concerns can be identified.
Lack of robustness: Algorithms are derived from associations, and since they do not understand underlying processes, they can be overly sensitive to changes.
Not accounting for statistical variability: Automated rankings based on limited data will be unreliable.
Implicit bias: To repeat, algorithms are based on associations, which may mean they end up using features that we would normally think are irrelevant to the task in hand.
Lack of transparency: Some algorithms may be opaque due to their sheer complexity. But even simple regression-based algorithms become totally inscrutable if their structure is private, perhaps through being a proprietary commercial product.
Artificial Intelligence
Summary • Algorithms built from data can be used for classification and prediction in technological applications. • It is important to guard against over-fitting an algorithm to training data, essentially fitting to noise rather than signal. • Algorithms can be evaluated by the classification accuracy, their ability to discriminate between groups, and their overall predictive accuracy. • Complex algorithms may lack transparency, and it may be worth trading off some accuracy for comprehension. • The use of algorithms and artificial intelligence presents many challenges, and insights into both
...more
CHAPTER 7 How Sure Can We Be About What Is Going On? Estimates and Intervals
The sample size should affect your confidence in the estimate, and knowing exactly how much difference it makes is a basic necessity for proper statistical inference.
Now we come to a critical step. In order to work out how accurate these statistics might be, we need to think of how much our statistics might change if we (in our imagination) were to repeat the sampling process many times.
If we knew how much these estimates would vary, then it would help tell us how accurate our actual estimate was. But unfortunately we could only work out the precise variability in our estimates if we knew precisely the details of the population. And this is exactly what we do not know.
There are two ways to resolve this circularity. The first is to make some mathematical assumptions about the shape of the population distribution, and use sophisticated probability theory to work out the variability we would expect in our estimate, and hence how far away we might expect, say, the average of our sample to be from the mean of the population. This is the traditional method that is taught in statistics textbooks, and we shall see how this works in Chapter 9.
However, there is an alternative approach, based on the plausible assumption that the population should...
This highlight has been truncated due to consecutive passage length restrictions.

