The Art of Statistics: Learning from Data
Rate it:
Open Preview
Kindle Notes & Highlights
Read between May 7 - May 10, 2020
9%
Flag icon
Someone who had been declared dead in Alabama could, at least in principle, cease to be legally dead were they across the state border in Florida, where the registration must be made by two qualified doctors.
9%
Flag icon
Then in the twentieth century statistics became more mathematical and, unfortunately for many students and practitioners, the topic became synonymous with the mechanical application of a bag of statistical tools, many named after eccentric and argumentative statisticians that we shall meet later in this book.
9%
Flag icon
The inappropriate use of standard statistical methods has received a fair share of the blame for what has become known as the reproducibility or replication crisis in science.
9%
Flag icon
For example, intensive analysis of data sets derived from routine data can increase the possibility of false discoveries, both due to systematic bias inherent in the data sources and from carrying out many analyses and only reporting whatever looks most interesting, a practice sometimes known as ‘data-dredging’.
9%
Flag icon
But improving data literacy means changing the way statistics is taught.
9%
Flag icon
Fortunately this is changing. The needs of data science and data literacy demand a more problem-driven approach, in which the application of specific statistical tools is seen as just one component of a complete cycle of investigation. The PPDAC structure has been suggested as a way of representing a problem-solving cycle, which we shall adopt throughout this book.
9%
Flag icon
Unfortunately, in the rush to get data and start analysis, attention to design is often glossed over.
10%
Flag icon
The Analysis stage has traditionally been the main emphasis of statistics courses, and we shall cover a range of analytic techniques in this book; but sometimes all that is required is a useful visualization, as in Figure 0.1
10%
Flag icon
Any conclusions generally raise more questions, and so the cycle starts over again, as when we started looking at the time of day when Shipman’s patients died.
11%
Flag icon
Bristol was awash with data, but none of the data sources could be considered the ‘truth’, and nobody had taken responsibility for analysing and acting on the surgical outcomes.
11%
Flag icon
Data that records whether individual events have happened or not is known as binary data, as it can only take on two values, generally labelled as yes and no.
11%
Flag icon
The theme of this chapter is that the basic presentation of statistics is important. In a sense we are jumping to the last step of the PPDAC cycle in which conclusions are communicated, and while the form of this communication has not traditionally been considered an important topic in statistics, rising interest in data visualization reflects a change in this attitude.
12%
Flag icon
The order of the rows of a table also needs to be considered carefully. Table 1.1 shows the hospitals in order of the number of operations in each, but if they had been presented, say, in order of mortality rates with the highest at the top of the table, this might give the impression that this was a valid and important way of comparing hospitals. Such league tables are favoured by the media and even some politicians, but can be grossly misleading: not only because the differences could be due to chance variation, but because the hospitals may be taking in very different types of cases.
12%
Flag icon
But the oldest trick of misleading graphics is to start the axis at say 95%, which will make the hospitals look extremely different, even if the variation is in fact only what is attributable to chance alone.
13%
Flag icon
When it comes to presenting categorical data, pie charts allow an impression of the size of each category relative to the whole pie, but are often visually confusing, especially if they attempt to show too many categories in the same chart, or use a three-dimensional representation that distorts areas.
13%
Flag icon
We’re all familiar with hyperbolic media headlines that warn us that something mundane increases the risk of some dread occurrence: I like to call these ‘cats cause cancer’ stories.
13%
Flag icon
The figure of 18% is known as a relative risk since it represents the increase in risk of getting bowel cancer between a group of people who eat 50g of processed meat a day, which could, for example, represent a daily two-rasher bacon sandwich, and a group who don’t. Statistical commentators took this relative risk and reframed it into a change in absolute risk, which means the change in the actual proportion in each group who would be expected to suffer the adverse event.
13%
Flag icon
That is one extra case of bowel cancer in all those 100 lifetime bacon-eaters, which does not sound so impressive as the relative risk (an 18% increase), and might serve to put this hazard into perspective. We need to distinguish what is actually dangerous from what sounds frightening.
13%
Flag icon
Although extremely common in the research literature, odds ratios are a rather unintuitive way to summarize differences in risk. If the events are fairly rare then the odds ratios will be numerically close to the relative risks, as in the case of bacon sandwiches, but for common events the odds ratio can be very different from the relative risk, and the following example shows this can be very confusing for journalists (and others).
14%
Flag icon
This highlights the danger of using odds ratios in anything but a scientific context, and the advantage of always reporting absolute risks as the quantity that is relevant for an audience, whether they are concerned with bacon, statins or anything else.
14%
Flag icon
Galton carried out what we might now call a data summary: he took a mass of numbers written on tickets and reduced them to a single estimated weight of 1,207 lb.
14%
Flag icon
First we will begin with my own attempt at a wisdom-of-crowds experiment, which demonstrates many of the problems that crop up when the real, undisciplined world, with all its capacity for oddity and error, is used as a source of data.
14%
Flag icon
To start, Figure 2.2 shows three ways of presenting the pattern of the values the 915 respondents provided: these patterns can be variously termed the data distribution, sample distribution or empirical distribution.fn2
15%
Flag icon
There is no ‘correct’ way to display sets of numbers: each of the plots we have used has some advantages: strip-charts show individual points, box-and-whisker plots are convenient for rapid visual summaries, and histograms give a good feel for the underlying shape of the data distribution.
15%
Flag icon
But it is not just for legs and testicles that mean-averages can be inappropriate. The mean number of reported sexual partners, and the mean income in a country, may both have little resemblance to most people’s experience. This is because means are unduly influenced by a few extremely high values which drag up the total:
15%
Flag icon
I can almost guarantee that, compared with people of your age and sex, you have far less than the average (mean) risk of dying next year. For example, the UK life tables report that 1% of 63-year-old men die each year before their 64th birthday, but many of those who will die are already seriously ill, and so the vast majority who are reasonably healthy will have less than this average risk.
16%
Flag icon
It is not enough to give a single summary for a distribution – we need to have an idea of the spread, sometimes known as the variability. For example, knowing the average adult male shoe size will not help a shoe firm decide the quantities of each size to make. One size does not fit all, a fact which is vividly illustrated by the seats for passengers in planes.
16%
Flag icon
Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric datafn8 since it is also unduly influenced by outlying values.
16%
Flag icon
This demonstrates that data often has some errors, outliers and other strange values, but these do not necessarily need to be individually identified and excluded. It also points to the benefits of using summary measures that are not unduly affected by odd observations such as 31,337 – these are known as robust measures, and include the median and the inter-quartile range.
16%
Flag icon
This is also reflected by the substantial difference between the means and the medians, which is a telling sign of data distributions with long right-hand tails.
17%
Flag icon
Large collections of numerical data are routinely summarized and communicated using a few statistics of location and spread, and the sexual-partner example has shown that these can take us a long way in grasping an overall pattern. However, there is no substitute for simply looking at data properly, and the next example shows that a good visualization is particularly valuable when we want to grasp the pattern in a large and complex set of numbers.
17%
Flag icon
It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient, an idea originally proposed by Francis Galton but formally published in 1895 by Karl Pearson, one of the founders of modern statistics.
17%
Flag icon
An alternative measure is called Spearman’s rank correlation after English psychologist Charles Spearman (who developed the idea of an underlying general intelligence), and depends only on the ranks of the data rather than their specific values. This means it can be near 1 or −1 if the points are close to a line that steadily increases or decreases, even if this line is not straight;
17%
Flag icon
In many applications the x-axis represents a quantity known as the independent variable, and interest focuses on its influence on the dependent variable plotted on the y-axis. But, as we shall explore further in Chapter 4 on causation, this presupposes the direction in which the influence might lie.
18%
Flag icon
Like any good graphic, this raises more questions and encourages further exploration, both in terms of identifying individual countries, and of course examining projections of future trends.
18%
Flag icon
However, Alberto Cairo has identified four common features of a good data visualization: It contains reliable information. The design has been chosen so that relevant patterns become noticeable. It is presented in an attractive manner, but appearance should not get in the way of honesty, clarity and depth. When appropriate, it is organized in a way that enables some exploration.
19%
Flag icon
The second rule of communication is to know what you want to achieve. Hopefully the aim is to encourage open debate, and informed decision-making. But there seems no harm in repeating yet again that numbers do not speak for themselves; the context, language and graphic design all contribute to the way the communication is received.
19%
Flag icon
Going from our sample (Stage 2) to the study population (Stage 3) is perhaps the most challenging step. We first need to be confident that the people asked to take part in the survey are a random sample from those who are eligible: this should be fine for a well-organized study like Natsal. But we also need to assume that the people who actually agree to take part are representative, and this is less straightforward.
20%
Flag icon
But often the question goes beyond simple description of data: we want to learn something bigger than just the observations in front of us, whether it is to make predictions (how many will come next year?), or say something more basic (why are the numbers increasing?).
20%
Flag icon
Once we want to start generalizing from the data – learning something about the world outside our immediate observations – we need to ask ourselves the question, ‘Learn about what?’ And this requires us to confront the challenging idea of inductive inference.
20%
Flag icon
The crucial distinction is that deduction is logically certain, whereas induction is generally uncertain.
20%
Flag icon
Of course, it would be ideal if we could go straight from looking at the raw data to making general claims about the target population. In standard statistics courses, observations are assumed to be drawn perfectly randomly and directly from the population of direct interest. But this is rarely the case in real life, and therefore we need to consider the entire process of going from raw data to our eventual target.
20%
Flag icon
We want our data to be: Reliable, in the sense of having low variability from occasion to occasion, and so being a precise or repeatable number. Valid, in the sense of measuring what you really want to measure, and not having a systematic bias.
20%
Flag icon
So when framed in terms of a more risky liberalization, the proposal was opposed by the majority, a reversal in opinion brought about by simple rewording of the question.2 The responses to questions can also be influenced by what has been asked beforehand, a process known as priming.
20%
Flag icon
Going from sample (Stage 2) to study population (Stage 3): this depends on the fundamental quality of the study, also known as its internal validity: does the sample we observe accurately reflect what is going on in the group we are actually studying?
21%
Flag icon
The idea of adequate ‘stirring’ is crucial: if you want to be able to generalize from the sample to the population, you need to make sure your sample is representative. Just having masses of data does not necessarily help guarantee a good sample and can even give false reassurance.
21%
Flag icon
the major problem occurs when we want to claim that the data on the study population – people who reported crimes – represents the target population of all crimes committed in England and Wales. Unfortunately, police-recorded crime systematically misses cases which the police do not record as a crime or which have not been reported by the victim;
21%
Flag icon
Whole websites are dedicated to listing the possible biases that can occur in statistical science, from allocation bias (systematic differences in who gets each of two medical treatments being compared) to volunteer bias (people volunteering for studies being systematically different from the general population).
22%
Flag icon
We have already discussed the concept of a data distribution – the pattern the data makes, sometimes known as the empirical or sample distribution. Next we must tackle the concept of a population distribution – the pattern in the whole group of interest.
22%
Flag icon
The classic example is the ‘bell-shaped curve’, or normal distribution, first explored in detail by Carl Friedrich Gauss in 1809 in the context of measurement errors in astronomy and surveying.
« Prev 1 3 7