More on this book
Community
Kindle Notes & Highlights
The mean of a random variable is also known as its expectation, and in all these samples we expect a proportion of 0.2 or 20%:
Note that the standard deviation of a statistic is generally termed the standard error, to distinguish it from the standard deviation of the population distribution from which it derives.
This type of graph is called a funnel plot and is extensively used when examining multiple health authorities or institutions, since it permits the identification of outliers without creating spurious league tables.
The data fall within the control limits rather well, which means that differences between districts are essentially what we would expect by chance variability alone. Smaller districts have fewer cases and so are more vulnerable to the role of chance, and therefore tend to have more extreme results – the rate in Rossendale was based on only 7 deaths, and so its rate could be drastically altered by just a few extra cases.
There is a crucial lesson in this simple example. Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers.
Individual data-points might be drawn from a wide variety of population distributions, some of which might be highly skewed, with long tails such as those of income or sexual partners. But we have now made the crucial shift to considering distributions of statistics rather than individual data-points, and these statistics will commonly be averages of some sort.
But the coin has no memory – the key insight is that the coin cannot compensate for past imbalances, but simply overwhelms them by more and more new, independent flips.
In Chapter 3 we introduced the classic ‘bell-shaped curve’, also known as the normal or Gaussian distribution, where we showed it described well the distribution of birth weights in the US population, and argued that this was because birth weight depends on a huge number of factors, all of which have a little influence – when we add up all those small effects we get a normal distribution.
it is a remarkable fact that virtually whatever the shape of the population distribution from which each of the original measurements are sampled, for large sample sizes their average can be considered to be drawn from a normal curve.
We have to find a way of reversing the process: instead of going from known populations to saying something about possible samples, we need to go from a single sample back to saying something about a possible population. This is the process of inductive inference outlined in Chapter 3.
This simple exercise reveals a major distinction between two types of uncertainty: what is known as aleatory uncertainty before I flip the coin – the ‘chance’ of an unpredictable event – and epistemic uncertainty after I flip the coin – an expression of our personal ignorance about an event that is fixed but unknown. The same difference exists between a lottery ticket (where the outcome depends on chance) and a scratch card (where the outcome is already decided, but you don’t know what it is).
So probability theory, which tells us what to expect in the future, is used to tell us what we can learn from what we have observed in the past. This is the (rather remarkable) basis for statistical inference.
We saw in Chapter 7 how bootstrapping could be used to get 95% intervals for the gradient of Galton’s regression of daughters’ on mothers’ heights. It is far easier to obtain exact intervals that are based on probability theory and provided in standard software, and Table 9.1 shows they give very similar results. The ‘exact’ intervals based on probability theory require more assumptions than the bootstrap approach, and strictly speaking would only be precisely correct if the underlying population distribution were normal. But the Central Limit Theorem means that with such a large sample size
...more
A simple rule of thumb is that, if you are estimating the percentage of people who prefer, say, coffee to tea for breakfast, and you ask a random sample from a population, then your margin of error (in %) is at most plus or minus 100 divided by the square root of the sample size.2 So for a survey of 1,000 people (the industry standard), the margin of error is generally quoted as ± 3%:
My personal, rather sceptical heuristic is that any quoted margin of error in a poll should be doubled to allow for systematic errors made in the polling.
We might not expect complete accuracy in pre-election polls, but we would expect more from scientists trying to measure physical facts about the world such as the speed of light. But there is a long history of claimed margins of error from such experiments later being found to be hopelessly inadequate: in the first part of the twentieth century, the uncertainty intervals around the estimates of the speed of light did not include the current accepted value.
margins of error should always be based on two components: Type A: the standard statistical measures discussed in this chapter, which would be expected to reduce with more observations. Type B: systematic errors that would not be expected to reduce with more observations, and have to be handled using non-statistical means such as expert judgement or external evidence.
The confidence intervals around the homicide counts in Figure 9.4 are of a totally different nature to margins of error around, say, unemployment figures. The latter are an expression of our epistemic uncertainty about the actual number of people unemployed, while the intervals around homicide counts are not expressing uncertainty about the actual number of homicides – we assume these have been correctly counted – but the underlying risks in society. These two types of interval may look similar, and even use similar mathematics, but they have fundamentally different interpretations.
A hypothesis can be defined as a proposed explanation for a phenomenon. It is not the absolute truth, but a provisional, working assumption, perhaps best thought of as a potential suspect in a criminal case.
observation = deterministic model + residual error.
Within statistical science, a hypothesis is considered to be a particular assumption about one of these components of a statistical model, with the connotation of being provisional, rather than ‘the truth’.
There must be a way of protecting us against false discoveries, and hypothesis testing attempts to fill that role.
The idea of a null hypothesis now becomes central: it is the simplified form of statistical model that we are going to work with until we have sufficient evidence against it.
The null hypothesis is what we are willing to assume is the case until proven otherwise. It is relentlessly negative, denying all progress and change.
So we can never claim that the null hypothesis has been actually proved: in the words of another great British statistician, Ronald Fisher, ‘the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.
Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.’
Similarly we shall find that we may reject the null hypothesis, but if we don’t have sufficient evidence to do so, it does not mean that we can accept it as truth.
In this case, the null hypothesis would be that there is truly no association whatsoever between arm-crossing and gender, in which case we would expect the observed difference in proportions between genders to be 0%.
In more complex situations it is not so straightforward to work out whether the data is compatible with the null hypothesis, but the following permutation test illustrates a powerful procedure that avoids complex mathematics.
An alternative approach, if we had rather a lot of time, would be to systematically work through all the possible permutations of the arm-crossing tickets, rather than just doing 1,000 simulations.
Fortunately we don’t have to perform these calculations since the probability distribution for the observed difference in proportions under the null hypothesis can be worked out in theory, and is shown in Figure 10.2(b) – it is based on what is known as the hypergeometric distribution, which gives the probability for a particular cell in the table taking on each possible value under random permutations.
We need a measure to summarize how close to the centre our observed value lies, and one summary is the ‘tail-area’ to the right of the dashed line shown in Figure 10.2, which is 45% or 0.45.
A P-value is the probability of getting a result at least as extreme as we did, if the null hypothesis (and all other modelling assumptions) were really true.
But an observed proportion in favour of males would also have led us to suspect the null hypothesis did not hold. We should therefore also calculate the chance of getting an observed difference of at least 7%, in either direction. This is known as a two-tailed P-value, corresponding to a two-sided test. This total tail area turns out to be 0.89, and since this value is near one it indicates that the observed value is near the centre of the null distribution.
The idea of statistical significance is straightforward: if a P-value is small enough, then we say the results are statistically significant.
Set up a question in terms of a null hypothesis that we want to check. This is generally given the notation H0.
Choose a test statistic that estimates something that, if it turned out to be extreme enough, would lead us to doubt the null hypothesis (often larger values of the statistic indicate incompatibility with the null hypothesis).
Generate the sampling distribution of this test statistic, were the ...
This highlight has been truncated due to consecutive passage length restrictions.
Check whether our observed statistic lies in the tails of this distribution and summarize this by the P-value: the probability, were the null hypothesis true, of observing such an extreme statis...
This highlight has been truncated due to consecutive passage length restrictions.
‘Extreme’ has to be defined carefully – if say both large positive and large negative values of the test statistic would have been considered incompatible with the null hypothesis...
This highlight has been truncated due to consecutive passage length restrictions.
Declare the result statistically significant if the P-value is below so...
This highlight has been truncated due to consecutive passage length restrictions.
This whole process has become known as Null Hypothesis Significance Testing (NHST) and, as we shall see below, it has become a source of major controversy.
Perhaps the most challenging component in null-hypothesis significance testing is Step 3 – establishing the distribution of the chosen test statistic under the null hypothesis.
Often we make use of approximations that were developed by the pioneers of statistical inference. For example, around 1900 Karl Pearson developed a series of statistics for testing associations in cross-tabulations such as Table 10.1, out of which grew the classic chi-squared test of association.
The development and use of test statistics and P-values has traditionally formed much of a standard statistics course, and has unfortunately given the field a reputation for being largely about picking the right formula and using the right tables.
This intimate link between hypothesis testing and confidence intervals should stop people misinterpreting results that are not statistically significantly different from 0 – this does not mean that the null hypothesis is actually true, but simply that a confidence interval for the true value includes 0. Unfortunately, as we shall see later, this lesson is often ignored.
Statitical significance is a clam about the quality of th evidence against the null hypothesis and not a claim about the truthfulnes of the null hypothethis itself
In Chapter 5 we demonstrated a multiple linear regression with son’s height as the response (dependent) variable, and mother’s and father’s height as explanatory (independent) variables.
The t-value, also known as a t-statistic, is a major focus of attention, since it is the link that tells us whether the association between an explanatory variable and the response is statistically significant.
The t-value is simply the estimate/standard error (this can be checked for the numbers in Table 10.5), and so can be interpreted as how far the estimate is away from 0,
measured in the number of standard errors.