The Art of Statistics: How to Learn from Data
Rate it:
Open Preview
Read between February 19 - June 10, 2020
19%
Flag icon
volunteer bias
19%
Flag icon
The ‘Bell-Shaped Curve’
19%
Flag icon
We have already discussed the concept of a data distribution—the pattern the data makes, sometimes known as the empirical or sample distribution. Next we must tackle the concept of a population distribution—the pattern in the whole group of interest.
19%
Flag icon
The shape of this distribution is important.
19%
Flag icon
The classic example is the ‘bell-shaped curve’, or normal distribution, first explored in detail by Carl Friedrich Gauss in 1809 in the context of measurement errors in astronomy and surveying.*
19%
Flag icon
Theory shows that the normal distribution can be expected to occur for phenomena that are driven by large numbers of small influences,
19%
Flag icon
Figure 3.2
20%
Flag icon
The normal distribution is characterized by its mean, or expectation, and its standard deviation,
20%
Flag icon
difference is that terms such as mean and standard deviation are known as statistics when describing a set of data, and parameters when describing a population.
20%
Flag icon
A great advantage of assuming a normal form for a distribution is that many important quantities can be simply obtained from tables or software.
20%
Flag icon
Z-score,
20%
Flag icon
which simply measures how many standard deviations a data-point is from the mean.
20%
Flag icon
The mean and standard deviation can be used as summary descriptions for (most) other distributions, but oth...
This highlight has been truncated due to consecutive passage length restrictions.
20%
Flag icon
quartiles,
20%
Flag icon
inter-quartile range,
20%
Flag icon
measure of the spread of the d...
This highlight has been truncated due to consecutive passage length restrictions.
20%
Flag icon
So a population can be thought of as a physical group of individuals, but also as providing the probability distribution for a random observation. This dual interpretation will be fundamental when we come to more formal statistical inference.
20%
Flag icon
the whole point of this chapter is that we do not generally know about populations, and so want to follow the inductive process and go the other way around, from data to population.
20%
Flag icon
What Is the Population?
20%
Flag icon
There are three types of populations from which a sample might be drawn, whether the data come from people, transactions, trees, or anything else.
20%
Flag icon
A literal population.
20%
Flag icon
virtual population.
21%
Flag icon
metaphorical population,
21%
Flag icon
It should be apparent that rather few applications of statistical science actually involve literal random sampling, and that it is increasingly common to have all the data that is potentially available. Nevertheless it is extremely valuable to keep hold of the idea of an imaginary population from which our ‘sample’ is drawn, as then we can use all the mathematical techniques that have been developed for sampling from real populations.
21%
Flag icon
Summary • Inductive inference requires working from our data, through study sample and study population, to a target population. • Problems and biases can crop up at each stage of this path. • The best way to proceed from sample to study population is to have drawn a random sample. • A population can be thought of as a group of individuals, but also as providing the probability distribution for a random observation drawn from that population. • Populations can be summarized using parameters that mirror the summary statistics of sample data. • Often data does not arise as a sample from a ...more
21%
Flag icon
CHAPTER 4 What Causes What?
21%
Flag icon
ascertainment bias
21%
Flag icon
‘Correlation Does Not Imply Causation’
22%
Flag icon
What Is ‘Causation’ Anyway?
22%
Flag icon
counter-factual.
22%
Flag icon
So we can never say that X caused Y in a specific case, only that X increases the proportion of times that Y happens.
22%
Flag icon
This has two vital consequences for what we have to do if we want to know what causes what. First, in order to infer causation with real confidence, we ideally need to intervene and perform experiments. Second, since this is a statistical or stochastic world, we need to intervene more than once in order to amass evidence.
22%
Flag icon
proper medical trial should ideally obey the following principles:
22%
Flag icon
1. Controls: If we want to investigate the effect of statins on a population, we can’t just give statins to a few people, and then, if they don’t have a heart attack, claim this was due to the pill (regardless of the websites that use this form of anecdotal reasoning to market their products). We need an intervention group, who will be given statins, and a control group who will be given sugar pills or placebos.
22%
Flag icon
2. Allocation of treatment: It is important to compare like with like, so the treatment and comparison groups ha...
This highlight has been truncated due to consecutive passage length restrictions.
22%
Flag icon
3. People should be counted in the groups to which they were allocated: The people allocated to the ‘statin’ group in the Heart Protection Study (HPS) were included in the final analysis even if they did not take their statins. This is known as the ‘intention to treat’ principle, and can seem rather odd. It means that the final estimate of the effect of statins really measures the effect of being prescribed statins rather than actually taking them.
23%
Flag icon
4. If possible, people should not even know which group they are in:
23%
Flag icon
blinded to the treatment
23%
Flag icon
5. Groups should be treated equally:
23%
Flag icon
6. If possible, those assessing the final outcomes should not know which group the subjects are in:
23%
Flag icon
7. Measure everyone:
24%
Flag icon
The main recent innovation in randomized experimentation concerns ‘A/B’ testing in web design, in which users are (unknowingly) directed to alternative layouts for web pages, and measurements made of time spent on pages, click-throughs to advertisements, and so on.
24%
Flag icon
What Do We Do When We Can’t Randomize?
24%
Flag icon
When the data does not arise from an experiment, it is said to be observational.
24%
Flag icon
What Can We Do When We Observe an Association?
24%
Flag icon
When an apparent association between two outcomes might be explained by some observed common factor that influences both, this common cause is known as a
24%
Flag icon
confounder:
24%
Flag icon
The simplest technique for dealing with confounders is to look at the apparent relationship within each level of the confounder. This is known as adjustment, or stratification.
25%
Flag icon
Simpson’s Paradox
25%
Flag icon
Simpson’s paradox, which occurs when the apparent direction of an association is reversed by adjusting for a confounding factor, requiring a complete change in the apparent lesson from the data.