More on this book
Community
Kindle Notes & Highlights
Given a t-value and the sample size, the software can provide a precise P-value; for large samples, t-values greater than 2 or less than −2 correspond to P < 0.05, although these th...
This highlight has been truncated due to consecutive passage length restrictions.
In Chapter 6 we saw that an algorithm might win a prediction competition by a very small margin. When predicting the survival of the Titanic test set, for example, the simple classification tree achieved the best Brier score (average mean squared prediction error) of 0.139, only slightly lower than the score of 0.142 from the averaged neural network (see Table 6.4). It is reasonable to ask whether this small winning margin of −0.003 is statistically significant, in the sense of whether or not it could be explained by chance variation.
If we do two trials, and look at the most extreme, the chance of getting at least one significant – and hence false-positive – result is close to 0.10 or 10%.fn5 The chance of getting at least one false-positive result increases quickly as we do more trials; if we do ten trials of useless drugs the chance of getting at least one significant at P < 0.05 gets as high as 40%. This is known as the problem of multiple testing, and occurs whenever many significance tests are carried out and then the most significant result is reported.
The twist was that the ‘subject’ was a 4lb Atlantic salmon, which ‘was not alive at the time of scanning’. Out of a total of 8,064 sites in the brain of this large dead fish, 16 showed a statistically significant response to the photographs. Rather than concluding the dead salmon had miraculous skills, the team correctly identified the problem of multiple testing – over 8,000 significance tests are bound to lead to false-positive results.
One way around this problem is to demand a very low P-value at which significance is declared, and the simplest method, known as the Bonferroni correction, is to use a threshold of 0.05/n, where n is number of tests done.
Another way to avoid false-positives is to demand replication of the original study, with the repeat experiment carried out in entirely different circumstances, but with essentially the same protocol.
If a small P-value is observed, then either something very surprising has happened, or the null hypothesis is untrue: the smaller the P-value, the more evidence that the null hypothesis might be an inappropriate assumption. This was intended as a fairly informal procedure, but in the 1930s Neyman and Pearson developed a theory of inductive behaviour which attempted to put hypothesis testing on a more rigorous mathematical footing.
They then considered the possible decisions after a hypothesis test, which are either to reject a null hypothesis in favour of the alternative, or not to reject the null.fn6 Two types of mistake are therefore possible: a Type I error is made when we reject a null hypothesis when it is true, and a Type II error is made when we do not reject a null hypothesis when in fact the alternative hypothesis holds. There is a strong legal analogy which is illustrated in Table 10.6 – a Type I legal error is to falsely convict an innocent person, and a Type II error is to find someone ‘not guilty’ when in
...more
When planning an experiment, Neyman and Pearson suggested that we should choose two quantities which together will determine how large the experiment should be. First, we should fix the probability of a Type I error, given the null is true, at a pre-specified value, say 0.05; this is known as the size of a test, and generally denoted α (alpha). Second, we should pre-specify the probability of a Type II error, given the alternative hypothesis is true, generally known as β (beta). In fact researchers generally work in terms of 1 – β, which is termed the power of a test, and is the chance of
...more
Formulae exist for the size and power of different forms of experiment, and they each depend crucially on sample size.
The Neyman–Pearson theory had its roots in industrial quality control, but is now used extensively in testing new medical treatments.
The idea of having a large enough sample to have sufficient power to detect a plausible alternative hypothesis has become totally entrenched in planning medical studies. But studies in psychology and neuroscience often have sample sizes chosen on the basis of convenience or tradition, and can be as low as 20 subjects per condition being studied.
There is some remarkable but complex theory, known by the delightful term ‘the Law of the Iterated Logarithm’, that shows that if we carry out such repeated testing, even if the null hypothesis is true, then we are certain to eventually reject that null at any significance level we choose.
Statisticians in the US and UK, working independently, developed what became known as the Sequential Probability Ratio Test (SPRT), which is a statistic that monitors accumulating evidence about deviations, and can at any time be compared with simple thresholds – as soon as one of these thresholds is crossed, then an alert is triggered and the production line is investigated.
A monitoring system for general practitioners was subsequently piloted, which immediately identified a GP with even higher mortality rates than Shipman! Investigation revealed this doctor practised in a south-coast town with a large number of retirement homes with many old people, and he conscientiously helped many of his patients to remain out of hospital for their death. It would have been completely inappropriate for this GP to receive any publicity for his apparently high rate of signing death certificates.
But in the last few decades P-values have become the currency of research, with vast numbers appearing in the scientific literature – a study scraped around 30,000 t-statistics and their accompanying P-values from just three years of papers in eighteen psychology and neuroscience journals.
In the real world of research, although experiments are carried out in the hope of making a discovery, it is recognized that most null hypotheses are (at least approximately) true.
The expected frequencies of the outcomes of 1,000 hypothesis tests carried out with size 5% (Type I error, α) and 80% power (1 − Type II error, 1−β). Only 10% (100) of the null hypotheses are false, and we correctly detect 80% of them (80). Of the 900 null hypotheses that are true, we incorrectly reject 45 (5%). Overall, of 125 ‘discoveries’, 36% (45) are false discoveries.
Similar problems can occur with P-values, which measure the chance of such extreme data occurring, if the null hypothesis is true, and do not measure the chance that the null hypothesis is true, given that such extreme data have occurred. This is a subtle but essential difference.
When the CERN teams reported a ‘five-sigma’ result for the Higgs boson, corresponding to a P-value of around 1 in 3.5 million, the BBC reported the conclusion correctly, saying this meant ‘about a one-in-3.5 million chance that the signal they see would appear if there were no Higgs particle.’ But nearly every other outlet got the meaning of this P-value wrong. For example, Forbes Magazine reported, ‘The chances are less than 1 in a million that it is not the Higgs boson’, a clear example of the prosecutor’s fallacy.
The ASA’s third principle seeks to counter the obsession with statistical significance: Scientific conclusions and business or policy decisions should not be based only on whether a P-value passes a specific threshold.
A dire consequence of this simple dichotomy is the misinterpretation of ‘not significant’. A non-significant P-value suggests the data are compatible with the null hypothesis, but this does not mean the null hypothesis is precisely true. After all, just because there’s no direct evidence that a criminal was at the scene of a crime, that does not mean he is innocent. But this mistake is surprisingly common.
A P-value, or statistical significance, does not measure the size of an effect or the importance of a result.
The main lessons from this scientific study might therefore be (a) that ‘big data’ can easily lead to findings that are statistically significant but not of practical significance, and (b) that you should not be concerned that studying for your degree is going to give you a brain tumour.
By itself, a P-value does not provide a good measure of evidence regarding a model or hypothesis. For example, a P-value near 0.05 taken by itself offers only weak evidence against the null hypothesis.
We shall see some of the poor consequences of this behaviour in Chapter 12, but first turn to an alternative approach to statistical inference that entirely rejects the whole idea of null-hypothesis significance testing.
I must now make an admission on behalf of the statistical community. The formal basis for learning from data is a bit of a mess. Although there have been numerous attempts to produce a single unifying theory of statistical inference, none has been fully accepted. It is no wonder mathematicians tend to dislike teaching statistics.
The good news is that the Bayesian approach opens fine new possibilities for making the most of complex data.
Thomas Bayes’ first great contribution was to use probability as an expression of our lack of knowledge about the world or, equivalently, our ignorance about what is currently going on. He showed that probability can be used not only for future events subject to random chance – aleatory uncertainty, to use the term introduced in Chapter 8 – but also for events which are true, and might well be known to some people, but that we are not privy to – so-called epistemic uncertainty.
If you briefly think about it, we are surrounded by epistemic uncertainty about things that are fixed but unknown to us. Gamblers bet on the next card to be dealt, we buy lottery scratch-cards, we discuss the possible sex of a baby, we puzzle over whodunnits, we argue over the numbers of tigers left in the wild, and we are told estimates of the possible number of migrants or the unemployed.
To emphasise again, from a Bayesian perspective, it is fine to use probabilities to represent our personal ignorance about these facts and numbers.
So these Bayesian probabilities are necessarily subjective – they depend on our relationship with the outside world, and are not properties of the world itself. These probabilities should change as we receive new information.
Which brings us to Bayes’ second key contribution: a result in probability theory that allows us to continuously revise our current probabilities in the light of new evidence.
Bayes’ legacy is the fundamental insight that the data does not speak for itself – our external knowledge, and even our judgement, has a central role.
You have three coins in your pocket: one has two heads, one is fair and one has two tails. You pick a coin at random and flip it, and it comes up heads. What should be your probability that the other side of the coin also shows heads?
There are many ways to check whether this is correct, but the easiest is to use the idea of expected frequencies demonstrated in Chapter 8
Three of the flips end up in a head, and in two of these the coin is two-headed. So your probability that the chosen coin is two-headed rather than fair should be 2⁄3, and not ½. Essentially, seeing a head makes it more likely that the two-headed coin has been chosen, since this coin provides two opportunities for a head to land face-up, whereas the fair coin only provides one.
We therefore expect a total of 19 + 49 = 68 positive tests, of whom only 19 are truly doping. So if someone tests positive, there is only 19/68 = 28% chance they are truly doping – the remaining 72% of positive tests are false accusations. Even though drug testing could be claimed to be ‘95% accurate’, the majority of people who test positive are in fact innocent –
One way of thinking of this process is that we are ‘reversing the order’ of the tree to put testing first, followed by the revelation of the truth. This is shown explicitly in Figure 11.3. This ‘reversed tree’ arrives at exactly the same numbers for the final outcomes, but respects the temporal order in which we come to know things (testing and then the truth about doping), rather than the actual timeline of underlying causation (doping and then testing). This ‘reversal’ is exactly what Bayes’ theorem does – in fact Bayesian thinking was known as ‘inverse probability’ until the 1950s.
The sports doping example shows how easy it is to confuse the probability of doping, given a positive test (28%), with the probability of testing positive, given doping (95%).
The doping example lays out the logical steps necessary to get to the quantity that is really of interest when making decisions: out of people who test positive, the proportion who are really doping, which turns out to be 19/68. The expected frequency tree shows that this depends on three crucial numbers: the proportion of athletes who are doping (1/50, or 20/1,000 in the tree), the proportion of doping athletes who correctly test positive (95%, or 19/20 in the tree) and the proportion of non-doping athletes who incorrectly test positive (5%, or 49/980 in the tree).
Next we need to introduce the idea of a likelihood ratio, a concept that has become critical in communicating the strength of forensic evidence in criminal court cases.
Judges and lawyers are being increasingly trained to understand likelihood ratios, which essentially compare the relative support provided by a piece of evidence for two competing hypotheses, which we shall call A and B, but which would often represent guilt or innocence.
So let’s put all this together in Bayes’ theorem, which simply says that the initial odds for a hypothesis × the likelihood ratio = the final odds for the hypothesis
In more technical language, the initial odds are known as the ‘prior’ odds, and the final odds are the ‘posterior’ odds. This formula can be repeatedly applied, with the posterior odds becoming the prior odds when introducing new, independent items of evidence.
Bayes’ theorem looks deceptively basic, but turns out to encapsulate an immensely powerful way of learning from data.
Then the probability of all the above events being true is 1/2 × 1/2 × 1/100 = 1/400. This is a fairly low chance that this skeleton is Richard III; the researchers who originally carried out this analysis assumed a ‘sceptical’ prior probability of 1/40, and so we are being considerably more sceptical.
Individual likelihood ratios are allowed in British courts, but they cannot be multiplied up, as in the case of Richard III, since the process of combining separate pieces of evidence is supposed to be left to the jury.3 The legal system is apparently not yet ready to embrace scientific logic.
Would the Archbishop of Canterbury cheat at poker? It is a lesser known fact about the renowned economist John Maynard Keynes that he studied probability, and came up with a thought experiment to illustrate the importance of taking into account the initial odds when assessing the implications of evidence.
Expected frequencies make Bayesian analysis reasonably straightforward for simple situations that involve only two hypotheses, say about whether someone does or does not have a disease, or has or has not committed an offence. However, things get trickier when we want to apply the same ideas to drawing inferences about unknown quantities that might take on a range of values, such as parameters in statistical models.