The Art of Statistics: How to Learn from Data
Rate it:
Open Preview
Read between February 19 - June 10, 2020
32%
Flag icon
Classification Trees
32%
Flag icon
classification tree is perhaps the simplest form of algorithm, since it consists of a series of yes/no questions, the answer to each deciding the next question to be asked, until a conclusion is reached.
32%
Flag icon
Assessing the Performance of an Algorithm
33%
Flag icon
the error matrix, or sometimes the confusion matrix.
33%
Flag icon
the percentage of true survivors that are correctly predicted is known as the sensitivity of the algorithm, while the percentage of true non-survivors that are correctly predicted is known as the specificity.
33%
Flag icon
it is a very crude measure of performance and takes no account of the confidence with which a prediction is made.
33%
Flag icon
Algorithms that give a probability (or any number) rather than a simple classification are often compared using Receiver Operating Characteristic (ROC) curves,
33%
Flag icon
By considering all possible thresholds for predicting a survivor, the possible values for the specificity and sensitivity form a curve.
33%
Flag icon
Figure 6.4 shows the ROC curves for training and test sets. A completely useless algorithm that assigns numbers at random would have a diagonal ROC curve, whereas the best algorithms will have ROC curves that move towards the top-left corner.
33%
Flag icon
The area under the ROC curve is one way of measuring how well an algorithm splits the survivors from the non-survivors, but it does not measure how good the probabilities are.
33%
Flag icon
How do we know how good ‘probability of precipitation’ forecasts are?
33%
Flag icon
probabilistic forecast, the model has to be run many times starting at slightly adjusted initial conditions, which produces a list of different ‘possible futures’,
33%
Flag icon
in some of which it rains and in some it doesn’t. Forecasters run an ‘ensemble’ of, say, fifty models, and if it rains in five of those possible futures in a particular place and time, they claim a ‘probability of precipitation’
33%
Flag icon
But how do we check how good these prob...
This highlight has been truncated due to consecutive passage length restrictions.
34%
Flag icon
Calibration plots allow us to see how reliable the stated probabilities are, by collecting together, say, the events given a particular probability of occurrence, and calculating the proportion of such events that actually occurred.
34%
Flag icon
Figure 6.5 shows the calibration plot for the simple classification tree applied to the test set.
34%
Flag icon
A Combined Measure of ‘Accuracy’ for Probabilities
34%
Flag icon
While the ROC curve assesses how well the algorithm splits the groups, and the calibration plot checks whether the probabilities mean what they say, it would be best to find a simple composite measure that combines both aspects into a single number we could use to compare algorithms. Fortunately weather forecasters back in the 1950s worked out exactly how to do this.
34%
Flag icon
The usual summary of the error over a number of days is the mean-squared-error (MSE)—this is the average of the squares of the errors, and is analogous to the least-squares criterion we saw used in regression analysis.
34%
Flag icon
The trick for probabilities is to use the same mean-squared-error criterion as when predicting a quantity, but treating a future observation of ‘rain’ as taking on the value 1, and ‘no rain’ as being
34%
Flag icon
The average mean-squared-error is known as the Brier score, after meteorologist Glenn Brier, who described the method in 1950.
34%
Flag icon
Unfortunately the Brier score is not easy to interpret on its own, and so it is difficult to get a feeling of whether any forecaster is doing well or badly; it is therefore best to compare it with a reference score derived from historical climate records.
34%
Flag icon
Table 6.2
34%
Flag icon
Forecasters then create a ‘skill score’, which is the proportional reduction of the reference score: in our case, 0.61,* meaning our algorithm has made a 61% improvement on a naïve forecaster who uses only climate data.
34%
Flag icon
Over-fitting
35%
Flag icon
This is known as over-fitting, and is one of the most vital topics in algorithm construction. By making an algorithm too complex, we essentially start fitting the noise rather than the signal.
35%
Flag icon
We over-fit when we go too far in adapting to local circumstances, in a worthy but misguided effort to be ‘unbiased’ and take into account all the available information.
35%
Flag icon
Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off.
35%
Flag icon
perhaps the most common protection is to use the simple but powerful idea of cross-validation
35%
Flag icon
Regression Models
36%
Flag icon
More Complex Techniques
36%
Flag icon
This reflects a general concern that algorithms that win Kaggle competitions tend to be very complex in order to achieve that tiny final margin needed to win. A major problem is that these algorithms tend to be inscrutable black boxes—they come up with a prediction, but it is almost impossible to work out what is going on inside. This has three negative aspects.
36%
Flag icon
First, extreme complexity makes implementation and upgrading a great effort:
36%
Flag icon
The second negative feature is that we do not know how the conclusion was arrived at, or what confidence we should have in it: we just have to take it or leave it.
36%
Flag icon
Finally, if we do not know how an algorithm is producing its answer, we cannot investigate it for implicit but systematic biases against some members of the community—
36%
Flag icon
once performance is ‘good enough’, it may be reasonable to trade off further small increases for the need to retain simplicity.
36%
Flag icon
Challenges of Algorithms
36%
Flag icon
Four main concerns can be identified.
37%
Flag icon
Lack of robustness: Algorithms are derived from associations, and since they do not understand underlying processes, they can be overly sensitive to changes.
37%
Flag icon
Not accounting for statistical variability: Automated rankings based on limited data will be unreliable.
37%
Flag icon
Implicit bias: To repeat, algorithms are based on associations, which may mean they end up using features that we would normally think are irrelevant to the task in hand.
37%
Flag icon
Lack of transparency: Some algorithms may be opaque due to their sheer complexity. But even simple regression-based algorithms become totally inscrutable if their structure is private, perhaps through being a proprietary commercial product.
38%
Flag icon
Artificial Intelligence
38%
Flag icon
Summary • Algorithms built from data can be used for classification and prediction in technological applications. • It is important to guard against over-fitting an algorithm to training data, essentially fitting to noise rather than signal. • Algorithms can be evaluated by the classification accuracy, their ability to discriminate between groups, and their overall predictive accuracy. • Complex algorithms may lack transparency, and it may be worth trading off some accuracy for comprehension. • The use of algorithms and artificial intelligence presents many challenges, and insights into both ...more
38%
Flag icon
CHAPTER 7 How Sure Can We Be About What Is Going On? Estimates and Intervals
39%
Flag icon
The sample size should affect your confidence in the estimate, and knowing exactly how much difference it makes is a basic necessity for proper statistical inference.
39%
Flag icon
Now we come to a critical step. In order to work out how accurate these statistics might be, we need to think of how much our statistics might change if we (in our imagination) were to repeat the sampling process many times.
39%
Flag icon
If we knew how much these estimates would vary, then it would help tell us how accurate our actual estimate was. But unfortunately we could only work out the precise variability in our estimates if we knew precisely the details of the population. And this is exactly what we do not know.
39%
Flag icon
There are two ways to resolve this circularity. The first is to make some mathematical assumptions about the shape of the population distribution, and use sophisticated probability theory to work out the variability we would expect in our estimate, and hence how far away we might expect, say, the average of our sample to be from the mean of the population. This is the traditional method that is taught in statistics textbooks, and we shall see how this works in Chapter 9.
40%
Flag icon
However, there is an alternative approach, based on the plausible assumption that the population should...
This highlight has been truncated due to consecutive passage length restrictions.