More on this book
Community
Kindle Notes & Highlights
Read between
February 19 - June 10, 2020
We can think of this type of iterative, exploratory work as ‘forensic’ statistics,
to turn experience into data, we have to start with rigorous definitions.
These examples show that statistics are always to some extent constructed on the basis of judgements, and it would be an obvious delusion to think the full complexity of personal experience can be unambiguously coded and put into a spreadsheet or other software.
Data has two main limitations as a source of such knowledge.
First, it is almost always an imperfect measure of what we are really interested in:
Second, anything we choose to measure will differ from place to place, from person to person, from time to time, and the problem is to extract meaningful insights from all this apparently random variability.
systematic bias inherent in the data sources and from carrying out many analyses and only reporting whatever looks most interesting, a practice sometimes known as ‘data-dredging’.
data literacy, which describes the ability to not only carry out statistical analysis on real-world problems, but also to understand and critique any conclusions drawn by others on the basis of statistics.
The PPDAC structure has been suggested as a way of representing a problem-solving cycle, which we shall adopt throughout this book.9 Figure 0.3
The first stage of the cycle is specifying a Problem; statistical inquiry always starts with a question,
It is tempting to skip over the need for a careful Plan.
Unfortunately, in the rush to get data and start analysis, attention to design is often glossed over.
Although in practice the PPDAC cycle laid out in Figure 0.3 may not be followed precisely, it underscores that formal techniques for statistical analysis play only one part in the work of a statistician or data scientist.
Summary
CHAPTER 1 Getting Things in Proportion: Categorical Data and Percentages
This is known as negative or positive framing, and its overall effect on how we feel is intuitive and well-documented: ‘5% mortality’ sounds worse than ‘95% survival’.
Note the two tricks used to manipulate the impact of this statistic: convert from a positive to a negative frame, and then turn a percentage into actual numbers of people.
Alberto Cairo, author of influential books on data visualization,3 suggests you should always begin with a ‘logical and meaningful baseline’,
Nate Silver, the founder of data-based platform FiveThirtyEight and first famous for accurately predicting the 2008 US presidential election, who eloquently expressed the idea that numbers do not speak for themselves—we are responsible for giving them meaning.
We now need to introduce an important and convenient concept that will help us get beyond simple yes/no questions.
Categorical Variables
Categorical variables are measures that can take on two or more categories, which may be
Unordered categories:
Ordered categories:
Numbers that have been grouped:
Comparing a Pair of Proportions
We need to distinguish what is actually dangerous from what sounds frightening.5
expected frequencies: instead of discussing percentages or probabilities, we just ask, ‘What does this mean for 100 (or 1,000) people?’ Psychological studies have shown that this technique improves understanding:
Technically, the odds for an event is the ratio of the chance of the event happening to the chance of it not happening. For example, since, out of 100 non-bacon eaters, 6
cancer in this group is 6/94, sometimes referred to as ‘6 to 94’.
odds ratios.
odds ratios are a rather unintuitive way to summarize differences in risk.
This highlights the danger of using odds ratios in anything but a scientific context, and the advantage of always reporting absolute risks as the quantity that is relevant for an audience, whether they are concerned with bacon, statins or anything else.
Summary
CHAPTER 2 Summarizing and Communicating Numbers. Lots of Numbers
Figure 2.2 shows three ways of presenting the pattern of the values the 915 respondents provided: these patterns can be variously termed the data distribution, sample distribution or empirical distribution.*
Variables which are recorded as numbers come in different varieties:
Count variables: where measurements are restricted to the integers 0, 1, 2…
Continuous variables: measurements that can be made, at least in principle, to arbitrary precision.
There are three basic interpretations of the term ‘average’, sometimes jokingly referred to by the single term ‘mean-median-mode’:
Interpreting the term ‘average’ as the mean-average gives rise to the old jokes about nearly everyone having greater than the average number of legs (which is presumably around 1.99999),
people having on average one testicle.
Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value
might help to distinguish between ‘average income’ (mean) and ‘the income of the average person’ (median).
Is this the average-house price (that is, the median)? Or the average house-price (that is, the mean)? A hyphen can make a big difference.
45% of people guessed below 1,616, and 55% guessed above, so there was little systematic tendency for the guesses to be either on the high or low side—we say the true value lay at the 45th percentile of the empirical data distribution.
Describing the Spread of a Data Distribution
It is not enough to give a single summary for a distribution—we need to have an idea of the spread, sometimes known as the variability.
The range is a natural choice, but is clearly very sensitive to extreme values
the inter-quartile range (IQR) is unaffected by extremes.