The Art of Statistics: How to Learn from Data
Rate it:
Open Preview
Read between February 19 - June 10, 2020
5%
Flag icon
We can think of this type of iterative, exploratory work as ‘forensic’ statistics,
5%
Flag icon
to turn experience into data, we have to start with rigorous definitions.
5%
Flag icon
These examples show that statistics are always to some extent constructed on the basis of judgements, and it would be an obvious delusion to think the full complexity of personal experience can be unambiguously coded and put into a spreadsheet or other software.
5%
Flag icon
Data has two main limitations as a source of such knowledge.
5%
Flag icon
First, it is almost always an imperfect measure of what we are really interested in:
5%
Flag icon
Second, anything we choose to measure will differ from place to place, from person to person, from time to time, and the problem is to extract meaningful insights from all this apparently random variability.
6%
Flag icon
systematic bias inherent in the data sources and from carrying out many analyses and only reporting whatever looks most interesting, a practice sometimes known as ‘data-dredging’.
6%
Flag icon
data literacy, which describes the ability to not only carry out statistical analysis on real-world problems, but also to understand and critique any conclusions drawn by others on the basis of statistics.
6%
Flag icon
The PPDAC structure has been suggested as a way of representing a problem-solving cycle, which we shall adopt throughout this book.9 Figure 0.3
6%
Flag icon
The first stage of the cycle is specifying a Problem; statistical inquiry always starts with a question,
6%
Flag icon
It is tempting to skip over the need for a careful Plan.
6%
Flag icon
Unfortunately, in the rush to get data and start analysis, attention to design is often glossed over.
7%
Flag icon
Although in practice the PPDAC cycle laid out in Figure 0.3 may not be followed precisely, it underscores that formal techniques for statistical analysis play only one part in the work of a statistician or data scientist.
7%
Flag icon
Summary
7%
Flag icon
CHAPTER 1 Getting Things in Proportion: Categorical Data and Percentages
8%
Flag icon
This is known as negative or positive framing, and its overall effect on how we feel is intuitive and well-documented: ‘5% mortality’ sounds worse than ‘95% survival’.
8%
Flag icon
Note the two tricks used to manipulate the impact of this statistic: convert from a positive to a negative frame, and then turn a percentage into actual numbers of people.
9%
Flag icon
Alberto Cairo, author of influential books on data visualization,3 suggests you should always begin with a ‘logical and meaningful baseline’,
9%
Flag icon
Nate Silver, the founder of data-based platform FiveThirtyEight and first famous for accurately predicting the 2008 US presidential election, who eloquently expressed the idea that numbers do not speak for themselves—we are responsible for giving them meaning.
9%
Flag icon
We now need to introduce an important and convenient concept that will help us get beyond simple yes/no questions.
9%
Flag icon
Categorical Variables
9%
Flag icon
Categorical variables are measures that can take on two or more categories, which may be
9%
Flag icon
Unordered categories:
9%
Flag icon
Ordered categories:
9%
Flag icon
Numbers that have been grouped:
9%
Flag icon
Comparing a Pair of Proportions
10%
Flag icon
We need to distinguish what is actually dangerous from what sounds frightening.5
10%
Flag icon
expected frequencies: instead of discussing percentages or probabilities, we just ask, ‘What does this mean for 100 (or 1,000) people?’ Psychological studies have shown that this technique improves understanding:
10%
Flag icon
Technically, the odds for an event is the ratio of the chance of the event happening to the chance of it not happening. For example, since, out of 100 non-bacon eaters, 6
10%
Flag icon
cancer in this group is 6/94, sometimes referred to as ‘6 to 94’.
10%
Flag icon
odds ratios.
10%
Flag icon
odds ratios are a rather unintuitive way to summarize differences in risk.
10%
Flag icon
This highlights the danger of using odds ratios in anything but a scientific context, and the advantage of always reporting absolute risks as the quantity that is relevant for an audience, whether they are concerned with bacon, statins or anything else.
11%
Flag icon
Summary
11%
Flag icon
CHAPTER 2 Summarizing and Communicating Numbers. Lots of Numbers
11%
Flag icon
Figure 2.2 shows three ways of presenting the pattern of the values the 915 respondents provided: these patterns can be variously termed the data distribution, sample distribution or empirical distribution.*
12%
Flag icon
Variables which are recorded as numbers come in different varieties:
12%
Flag icon
Count variables: where measurements are restricted to the integers 0, 1, 2…
12%
Flag icon
Continuous variables: measurements that can be made, at least in principle, to arbitrary precision.
12%
Flag icon
There are three basic interpretations of the term ‘average’, sometimes jokingly referred to by the single term ‘mean-median-mode’:
12%
Flag icon
Interpreting the term ‘average’ as the mean-average gives rise to the old jokes about nearly everyone having greater than the average number of legs (which is presumably around 1.99999),
12%
Flag icon
people having on average one testicle.
12%
Flag icon
Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value
12%
Flag icon
might help to distinguish between ‘average income’ (mean) and ‘the income of the average person’ (median).
12%
Flag icon
Is this the average-house price (that is, the median)? Or the average house-price (that is, the mean)? A hyphen can make a big difference.
12%
Flag icon
45% of people guessed below 1,616, and 55% guessed above, so there was little systematic tendency for the guesses to be either on the high or low side—we say the true value lay at the 45th percentile of the empirical data distribution.
12%
Flag icon
Describing the Spread of a Data Distribution
12%
Flag icon
It is not enough to give a single summary for a distribution—we need to have an idea of the spread, sometimes known as the variability.
13%
Flag icon
The range is a natural choice, but is clearly very sensitive to extreme values
13%
Flag icon
the inter-quartile range (IQR) is unaffected by extremes.
« Prev 1 3 6