More on this book
Community
Kindle Notes & Highlights
Read between
April 27 - May 26, 2020
The first rule of communication is to shut up and listen, so that you can get to know about the audience for your communication, whether it might be politicians, professionals or the general public. We have to understand their inevitable limitations and any misunderstandings, and fight the temptation to be too sophisticated and clever, or put in too much detail. The second rule of communication is to know what you want to achieve. Hopefully the aim is to encourage open debate, and informed decision-making. But there seems no harm in repeating yet again that numbers do not speak for themselves;
...more
Many people have a vague idea of deduction, thanks to Sherlock Holmes using deductive reasoning when he coolly announces that a suspect must have committed a crime. In real life deduction is the process of using the rules of cold logic to work from general premises to particular conclusions. If the law of the country is that cars should drive on the right, then we can deduce that on any particular occasion it is best to drive on the right. But induction works the other way, in taking particular instances and trying to work out general conclusions. For example, suppose we don’t know the customs
...more
There are three types of populations from which a sample might be drawn, whether the data come from people, transactions, trees, or anything else. A literal population. This is an identifiable group, such as when we pick a person at random when polling. Or there may be a group of individuals who could be measured, and although we don’t actually pick one at random, we have data from volunteers. For example, we might consider the people who guessed at the number of jelly beans as a sample from the population of all maths nerds who watch YouTube videos. A virtual population. We frequently take
...more
This highlight has been truncated due to consecutive passage length restrictions.
Epidemiology is the study of how and why diseases occur in the population, and Scandinavian countries are an epidemiologist’s dream. This is because everyone in those countries has a personal identity number which is used when registering for health care, education, tax, and so on, and this allows researchers to link all these different aspects of people’s lives together in a way that would be impossible (and perhaps politically controversial) in other countries. A typically ambitious study was conducted on over 4 million Swedish men and women whose tax and health records were linked over
...more
This highlight has been truncated due to consecutive passage length restrictions.
Two final key points: Don’t rely on a single study: A single statin trial may tell us that the drug worked in a particular group in a particular place, but robust conclusions require multiple studies. Review the evidence systematically: When looking at multiple trials, make sure to include every study that has been done, and so create what is known as a systematic review. The results may then be formally combined in a meta-analysis.
Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, for example the fitted straight line that enables us to make a prediction of a son’s height from his father’s. But the deterministic part of a model is not going to be a perfect representation of the observed world. As we saw in Figure 5.1, there is a big scatter of heights around the regression line, and the difference between what the model predicts, and what actually happens, is the second component of a model and is known as the residual error – although it is
...more
We have a strong psychological tendency to attribute change to intervention, and this makes before-and-after comparisons treacherous. A classic example concerns speed cameras, which tend to get put in places that have recently experienced accidents. When the accident rate subsequently goes down, this change is then attributed to the presence of the cameras. But would the accident rates have gone down anyway? Strings of good (or bad) luck do not go on for ever, and eventually things settle back down – this can also be considered as regression-to-the-mean, just like tall fathers tending to have
...more
Regression-to-the-mean also operates in clinical trials. In the last chapter we saw that randomized trials were needed to evaluate new pharmaceuticals properly, since even people in the control arm showed benefit – the so-called placebo effect. This is often interpreted to mean that just taking a sugar pill (preferably a red one) actually has a beneficial effect on people’s health. But much of the improvement seen in people who do not receive any active treatment may be regression-to-the-mean, since patients are enrolled in trials when they are showing symptoms, and many of these would have
...more
A good analogy is that a model is like a map, rather than the territory itself. And we all know that some maps are better than others: a simple one might be good enough to drive between cities, but we need something more detailed when walking through the countryside. The British statistician George Box has become famous for his brief but invaluable aphorism: ‘All models are wrong, some are useful.’ This pithy statement was based on a lifetime spent bringing statistical expertise to industrial processes, which led Box to appreciate both the power of models, but also the danger of actually
...more
Senior managers simply did not realize the frail basis on which these models were built, losing track of the fact that models are simplifications of the real world – they are the maps not the territory. The result was one of the worst global economic crises in history.
this is ‘technology’ rather than science. There are two broad tasks for such an algorithm: Classification (also known as discrimination or supervised learning): to say what kind of situation we’re facing. For example, the likes and dislikes of an online customer, or whether that object in a robot’s vision is a child or a dog. Prediction: to tell us what is going to happen. For example, what the weather will be next week, what a stock price might do tomorrow, what products that customer might buy, or whether that child is going to run out in front of our self-driving car. Although these tasks
...more
‘Narrow’ AI refers to systems that can carry out closely prescribed tasks, and there have been some extraordinarily successful examples based on machine learning, which involves developing algorithms through statistical analysis of large sets of historical examples. Notable successes include speech recognition systems built into phones, tablets and computers; programs such as Google Translate which know little grammar but have learned to translate text from an immense published archive; and computer vision software that uses past images to ‘learn’ to identify, say, faces in photographs or
...more
This highlight has been truncated due to consecutive passage length restrictions.
The other way that data can be ‘big’ is by measuring many characteristics, or features, on each example. This quantity is often known as p, perhaps denoting parameters. Thinking again back to my statistical youth, p used to be generally less than 10 – perhaps we knew a few items of an individual’s medical history. But then we started having access to millions of that person’s genes, and genomics became a small n, large p problem, where there was a huge amount of information about a relatively small number of cases. And now we have entered the era of large n, large p problems, in which there
...more
One strategy for dealing with an excessive number of cases is to identify groups that are similar, a process known as clustering or unsupervised learning, since we have to learn about these groups and are not told in advance that they exist. Finding these fairly homogeneous clusters can be an end in itself, for example by identifying groups of people with similar likes and dislikes, which then can be characterized, given a label, and algorithms built for classifying future cases. The clusters that have been identified can then be fed appropriate film recommendations, advertisements, or
...more
This highlight has been truncated due to consecutive passage length restrictions.
A bewildering range of alternative methods are now readily available for building classification and prediction algorithms. Researchers used to promote methods which came from their own professional backgrounds: for example statisticians preferred regression models, while computer scientists preferred rule-based logic or ‘neural networks’ which were alternative ways to try and mimic human cognition. Implementation of any of these methods required specialized skills and software, but now convenient programs allow a menu-driven choice of technique, and so encourage a less partisan approach where
...more
This highlight has been truncated due to consecutive passage length restrictions.
survivors that are correctly predicted is known as the sensitivity of the algorithm, while the percentage of true non-survivors that are correctly predicted is known as the specificity. These terms arise from medical diagnostic testing. Although the overall accuracy is simple to express, it is a very crude measure of performance and takes no account of the confidence with which a prediction is made. If we look at the tips of the branches of the classification tree, we can see that the discrimination of the training data is not perfect, and at all branches there are some who survive and some
...more
In practice weather forecasts are based on extremely complex computer models which encapsulate detailed mathematical formulae representing how weather develops from current conditions, and each run of the model produces a deterministic yes/no prediction of rain at a particular place and time. So to produce a probabilistic forecast, the model has to be run many times starting at slightly adjusted initial conditions, which produces a list of different ‘possible futures’, in some of which it rains and in some it doesn’t. Forecasters run an ‘ensemble’ of, say, fifty models, and if it rains in five
...more
We over-fit when we go too far in adapting to local circumstances, in a worthy but misguided effort to be ‘unbiased’ and take into account all the available information. Usually we would applaud the aim of being unbiased, but this refinement means we have less data to work on, and so the reliability goes down. Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off. We can illustrate this subtle idea by imagining a huge database of people’s lives
...more
This highlight has been truncated due to consecutive passage length restrictions.
It is essential to test any predictions on an independent test set that was not used in the training of the algorithm, but that only happens at the end of the development process. So although it might show up our over-fitting at that time, it does not build us a better algorithm. We can, however, mimic having an independent test set by removing say 10% of the training data, developing the algorithm on the remaining 90%, and testing on the removed 10%. This is cross-validation, and can be carried out systematically by removing 10% in turn and repeating the procedure ten times, a procedure known
...more
This highlight has been truncated due to consecutive passage length restrictions.
Implicit bias: To repeat, algorithms are based on associations, which may mean they end up using features that we would normally think are irrelevant to the task in hand. When a vision algorithm was trained to discriminate pictures of huskies from German Shepherds, it was very effective until it failed on huskies that were kept as pets – it turned out that its apparent skill was based on identifying snow in the background.4 Less trivial examples include an algorithm for identifying beauty that did not like dark skin, and another that identified Black people as gorillas. Algorithms that can
...more
This highlight has been truncated due to consecutive passage length restrictions.
Traditionally a statistics course would start with probability – that is how I have always begun when teaching in Cambridge – but this rather mathematical initiation can be an obstruction to grasping all the important ideas in the preceding chapters that did not require probability theory. In contrast, this book is part of what could be called a new wave in statistics teaching, in which formal probability theory as a basis for statistical inference does not come in till much later.2 We have seen that computer simulation is a very powerful tool for both exploring possible future events and
...more
This is the idea of ‘expected frequency’. When faced with the problem of the two coins, you ask yourself, ‘What would I expect to happen if I tried the experiment a number of times?’ Let’s say that you tried flipping first one coin, and then another, a total of four times. I suspect that even a politician could, with a bit of thought, conclude that they would expect to get the results shown in Figure 8.2. So 1 in 4 times you would expect to get two heads. Therefore, the reasoning goes, the probability that on a particular attempt you would get two heads is 1 in 4, or ¼. Which, fortunately, is
...more
This highlight has been truncated due to consecutive passage length restrictions.
We are still making strong assumptions, even in this simple coin-flipping example. We are assuming the coin is fair and balanced, it is flipped properly so the result is not predictable, it does not land on its edge, an asteroid does not strike after the first flip, and so on. These are serious considerations (except possibly the asteroid): they serve to emphasize that all the probabilities we use are conditional – there is no such thing as the unconditional probability of an event; there are always assumptions and other factors that could affect the probability. And, as we now see, we need to
...more
Fortunately we don’t have to believe that events are actually driven by pure randomness (whatever that is). It is simply that an assumption of ‘chance’ encapsulates all the inevitable unpredictability in the world, or what is sometimes termed natural variability. We have therefore established that probability forms the appropriate mathematical foundation for both ‘pure’ randomness, which occurs with subatomic particles, coins, dice, and so on; and ‘natural’, unavoidable variability, such as in birth weights, survival after surgery, examination results, homicides, and every other phenomenon
...more
Apart from his work on wisdom of crowds, correlation, regression, and almost everything else, Francis Galton also considered it a true marvel that the normal distribution, then known as the Law of Frequency of Error, should arise in an orderly way out of apparent chaos: I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ‘Law of Frequency of Error’. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The
...more
This simple exercise reveals a major distinction between two types of uncertainty: what is known as aleatory uncertainty before I flip the coin – the ‘chance’ of an unpredictable event – and epistemic uncertainty after I flip the coin – an expression of our personal ignorance about an event that is fixed but unknown. The same difference exists between a lottery ticket (where the outcome depends on chance) and a scratch card (where the outcome is already decided, but you don’t know what it is). Statistics are used when we have epistemic uncertainty about some quantity of the world. For example,
...more
The procedure for deriving an uncertainty interval around our estimate, or equivalently a margin of error, is based on this fundamental idea. There are three stages: We use probability theory to tell us, for any particular population parameter, an interval in which we expect the observed statistic to lie with 95% probability. These are 95% prediction intervals, such as those displayed in the inner funnel in Figure 9.2. Then we observe a particular statistic. Finally (and this is the difficult bit) we work out the range of possible population parameters for which our statistic lies in their 95%
...more
A simple rule of thumb is that, if you are estimating the percentage of people who prefer, say, coffee to tea for breakfast, and you ask a random sample from a population, then your margin of error (in %) is at most plus or minus 100 divided by the square root of the sample size.2 So for a survey of 1,000 people (the industry standard), the margin of error is generally quoted as ± 3%:fn8 if 400 of them said they preferred coffee, and 600 of them said they preferred tea, then you could roughly estimate the underlying percentage of people in the population who prefer coffee as 40 ± 3%, or
...more
We have already seen many of the reasons why surveys can be inaccurate, on top of the inevitable (and quantifiable) margin of error due to random variability. In this case the excess variability might be blamed on the sampling methods, in particular the use of telephone polls with a very low response rate, perhaps between 10% and 20%, and mainly using landlines. My personal, rather sceptical heuristic is that any quoted margin of error in a poll should be doubled to allow for systematic errors made in the polling.
But there is a long history of claimed margins of error from such experiments later being found to be hopelessly inadequate: in the first part of the twentieth century, the uncertainty intervals around the estimates of the speed of light did not include the current accepted value. This has led organizations that work on metrology, the science of measurement, to specify that margins of error should always be based on two components: Type A: the standard statistical measures discussed in this chapter, which would be expected to reduce with more observations. Type B: systematic errors that would
...more
The standard deviation of this Poisson distribution is the square root of m, written √m, which is also the standard error of our estimate. This would allow us to create a confidence interval, if only we knew m. But we don’t (that’s the whole point of the exercise). Consider the 2014–2015 period, when there were 497 homicides, which is our estimate for the underlying rate m that year. We can use this estimate for m to estimate the standard error √m as √497 = 22.3. This gives a margin of error of ± 1.96 × 22.3 = ± 43.7. So we can finally get to our approximate 95% interval for m as 497 ± 43.7 =
...more
He found there had been more males than females baptized in every year, with an overall sex ratio of 107, varying between 101 and 116 over the period. But Arbuthnot wanted to claim a more general law, and so argued that if there were really no difference in the underlying rates of boys and girls being born, then each year there would be a 50:50 chance that more boys than girls were born, or more girls than boys, just like flipping a coin. But to get an excess of boys in every year would then be like flipping a fair coin 82 times in a row, and getting heads every time. The probability of this
...more
This highlight has been truncated due to consecutive passage length restrictions.
What Is a ‘Hypothesis’? A hypothesis can be defined as a proposed explanation for a phenomenon. It is not the absolute truth, but a provisional, working assumption, perhaps best thought of as a potential suspect in a criminal case. When discussing regression in Chapter 5, we saw the claim that observation = deterministic model + residual error. This represents the idea that statistical models are mathematical representations of what we observe, which combine a deterministic component with a ‘stochastic’ component, the latter representing unpredictability or random ‘error’, generally expressed
...more
The idea of a null hypothesis now becomes central: it is the simplified form of statistical model that we are going to work with until we have sufficient evidence against it. In the questions listed above, the null hypotheses might be: The daily number of homicides in the UK do follow a Poisson distribution. The UK unemployment rate has remained unchanged over the last quarter. Statins do not reduce the risk of heart attacks and strokes in people like me. Mothers’ heights have no effect on sons’ heights, once fathers’ heights are taken into account. The Higgs boson does not exist. The null
...more
This highlight has been truncated due to consecutive passage length restrictions.
This tail-area is known as a P-value, one of the most prominent concepts in statistics as practised today, and which therefore deserves a formal definition in the text: A P-value is the probability of getting a result at least as extreme as we did, if the null hypothesis (and all other modelling assumptions) were really true. The issue, of course, is what do we mean by ‘extreme’? Our current P-value of 0.45 is one-tailed, since it only measures how likely it is that we would have observed such an extreme value in favour of females, were the null hypothesis really true. This P-value corresponds
...more
This highlight has been truncated due to consecutive passage length restrictions.
This term was popularized by Ronald Fisher in the 1920s and, in spite of the criticisms we shall see later, continues to play a major role in statistics. Ronald Fisher was an extraordinary, but difficult, man. He was extraordinary because he is regarded as a pioneering figure in two distinct fields – genetics and statistics. Yet he had a notorious temper and could be extremely critical of anyone who he felt questioned his ideas, while his support for eugenics and his public criticism of the evidence for the link between smoking and lung cancer damaged his standing. His personal reputation has
...more
This highlight has been truncated due to consecutive passage length restrictions.
To summarize, I have described the following steps: Set up a question in terms of a null hypothesis that we want to check. This is generally given the notation H0. Choose a test statistic that estimates something that, if it turned out to be extreme enough, would lead us to doubt the null hypothesis (often larger values of the statistic indicate incompatibility with the null hypothesis). Generate the sampling distribution of this test statistic, were the null hypothesis true. Check whether our observed statistic lies in the tails of this distribution and summarize this by the P-value: the
...more
This highlight has been truncated due to consecutive passage length restrictions.
Often we make use of approximations that were developed by the pioneers of statistical inference. For example, around 1900 Karl Pearson developed a series of statistics for testing associations in cross-tabulations such as Table 10.1, out of which grew the classic chi-squared test of association.fn3 These test statistics involve calculating the expected number of events in each cell of the table, were the null hypothesis of no-association true, and then a chi-squared statistic measures the total discrepancy between the observed and expected counts. Table 10.2 shows the expected numbers in the
...more
This highlight has been truncated due to consecutive passage length restrictions.
In Chapter 7 we saw a quarterly change in unemployment of 3,000 had a margin of error of ± 77,000, based on ± 2 standard errors. This means the 95% confidence interval runs from −80,000 to +74,000 and clearly contains the value 0, corresponding to no change in unemployment. But the fact that this 95% interval includes 0 is logically equivalent to the point estimate (−3,000) being less than 2 standard errors from 0, meaning the change is not significantly different from 0. This reveals the essential identity between hypothesis testing and confidence intervals: A two-sided P-value is less than
...more
t-statistic, is a major focus of attention, since it is the link that tells us whether the association between an explanatory variable and the response is statistically significant. The t-value is a special case of what is known as a Student’s t-statistic. ‘Student’ was the pseudonym of William Gosset, who developed the method in 1908 while on secondment at University College London from the Guinness brewery in Dublin – they wanted to preserve their employee’s anonymity. The t-value is simply the estimate/standard error (this can be checked for the numbers in Table 10.5), and so can be
...more
In Chapter 6 we saw that an algorithm might win a prediction competition by a very small margin. When predicting the survival of the Titanic test set, for example, the simple classification tree achieved the best Brier score (average mean squared prediction error) of 0.139, only slightly lower than the score of 0.142 from the averaged neural network (see Table 6.4). It is reasonable to ask whether this small winning margin of −0.003 is statistically significant, in the sense of whether or not it could be explained by chance variation. This is straightforward to check, and the t-statistic turns
...more
Another way to avoid false-positives is to demand replication of the original study, with the repeat experiment carried out in entirely different circumstances, but with essentially the same protocol. For new pharmaceuticals to be approved by the US Food and Drug Administration, it has become standard that two independent clinical trials must have been carried out, each showing clinical benefit that is significant at P < 0.05. This means that the overall chance of approving a drug, that in truth has no benefit at all, is 0.05 × 0.05 = 0.0025, or 1 in 400.
Type I error is made when we reject a null hypothesis when it is true, and a Type II error is made when we do not reject a null hypothesis when in fact the alternative hypothesis holds. There is a strong legal analogy which is illustrated in Table 10.6 – a Type I legal error is to falsely convict an innocent person, and a Type II error is to find someone ‘not guilty’ when in fact they did commit the crime. When planning an experiment, Neyman and Pearson suggested that we should choose two quantities which together will determine how large the experiment should be. First, we should fix the
...more
Formulae exist for the size and power of different forms of experiment, and they each depend crucially on sample size. But if the sample size is fixed, there is an inevitable trade-off: to increase power, we can always make the threshold for ‘significance’ less stringent and so make it more likely we will correctly identify a true effect, but this means increasing the chance of a Type I error (the size). In the legal analogy, we can loosen the criteria for conviction, say by loosening the requirement of proof ‘beyond reasonable doubt’, and this will result in more criminals being correctly
...more
As we shall see in the next chapter, Neyman and Pearson had vehement, even publicly abusive, arguments with Fisher over the appropriate form of hypothesis testing, and this conflict has never been resolved into a single ‘correct’ approach. The Heart Protection Study shows that clinical trials tend to be designed from a Neyman–Pearson perspective but, strictly speaking, size and power are irrelevant once the experiment has actually been carried out. At this point the trial is analysed using confidence intervals to show the plausible values for the treatment effects, and Fisherian P-values to
...more
Statisticians in the US and UK, working independently, developed what became known as the Sequential Probability Ratio Test (SPRT), which is a statistic that monitors accumulating evidence about deviations, and can at any time be compared with simple thresholds – as soon as one of these thresholds is crossed, then an alert is triggered and the production line is investigated.fn8 Such techniques led to more efficient industrial processes, and were later adapted for use in so-called sequential clinical trials in which accumulated results are repeatedly monitored to see if a threshold that
...more
This highlight has been truncated due to consecutive passage length restrictions.
A monitoring system for general practitioners was subsequently piloted, which immediately identified a GP with even higher mortality rates than Shipman! Investigation revealed this doctor practised in a south-coast town with a large number of retirement homes with many old people, and he conscientiously helped many of his patients to remain out of hospital for their death. It would have been completely inappropriate for this GP to receive any publicity for his apparently high rate of signing death certificates. The lesson here is that while statistical systems can detect outlying outcomes,
...more
This reveals that we expect to claim 125 ‘discoveries’, but of these 45 are false-positives: in other words 36%, or over a third, of the rejected null hypotheses (the ‘discoveries’) are incorrect claims. This rather gloomy picture becomes even worse when we consider what actually ends up in the scientific literature, since journals are biased towards publishing positive results. A similar analysis of scientific studies led to Stanford professor of medicine and statistics John Ioannidis’s famous claim in 2005 that ‘most published research findings are false’.9 We shall return to the reasons for
...more
This claim, partly based on the ‘Bayesian’ reasoning outlined in the next chapter, has led a prominent group of statisticians to argue that the standard threshold for a ‘discovery’ of a new effect should be changed to P < 0.005.11 What effect might this have? Changing the criterion for ‘significance’ from 0.05 (1 in 20) to 0.005 (1 in 200) in Figure 10.5 would mean that instead of having 45 false-positive ‘discoveries’, we would have only 4.5. This would reduce the total number of discoveries to 84.5, and of these only 4.5 (5%) would be false discoveries. Which would be a considerable
...more
This highlight has been truncated due to consecutive passage length restrictions.
Which brings us to Bayes’ second key contribution: a result in probability theory that allows us to continuously revise our current probabilities in the light of new evidence. This has become known as Bayes’ theorem, and essentially provides a formal mechanism for learning from experience, which is an extraordinary achievement for an obscure clergyman from a small English spa town. Bayes’ legacy is the fundamental insight that the data does not speak for itself – our external knowledge, and even our judgement, has a central role. This may seem to be incompatible with the scientific process,
...more