More on this book
Community
Kindle Notes & Highlights
Read between
February 19 - June 10, 2020
volunteer bias
The ‘Bell-Shaped Curve’
We have already discussed the concept of a data distribution—the pattern the data makes, sometimes known as the empirical or sample distribution. Next we must tackle the concept of a population distribution—the pattern in the whole group of interest.
The shape of this distribution is important.
The classic example is the ‘bell-shaped curve’, or normal distribution, first explored in detail by Carl Friedrich Gauss in 1809 in the context of measurement errors in astronomy and surveying.*
Theory shows that the normal distribution can be expected to occur for phenomena that are driven by large numbers of small influences,
Figure 3.2
The normal distribution is characterized by its mean, or expectation, and its standard deviation,
difference is that terms such as mean and standard deviation are known as statistics when describing a set of data, and parameters when describing a population.
A great advantage of assuming a normal form for a distribution is that many important quantities can be simply obtained from tables or software.
Z-score,
which simply measures how many standard deviations a data-point is from the mean.
The mean and standard deviation can be used as summary descriptions for (most) other distributions, but oth...
This highlight has been truncated due to consecutive passage length restrictions.
quartiles,
inter-quartile range,
measure of the spread of the d...
This highlight has been truncated due to consecutive passage length restrictions.
So a population can be thought of as a physical group of individuals, but also as providing the probability distribution for a random observation. This dual interpretation will be fundamental when we come to more formal statistical inference.
the whole point of this chapter is that we do not generally know about populations, and so want to follow the inductive process and go the other way around, from data to population.
What Is the Population?
There are three types of populations from which a sample might be drawn, whether the data come from people, transactions, trees, or anything else.
A literal population.
virtual population.
metaphorical population,
It should be apparent that rather few applications of statistical science actually involve literal random sampling, and that it is increasingly common to have all the data that is potentially available. Nevertheless it is extremely valuable to keep hold of the idea of an imaginary population from which our ‘sample’ is drawn, as then we can use all the mathematical techniques that have been developed for sampling from real populations.
Summary • Inductive inference requires working from our data, through study sample and study population, to a target population. • Problems and biases can crop up at each stage of this path. • The best way to proceed from sample to study population is to have drawn a random sample. • A population can be thought of as a group of individuals, but also as providing the probability distribution for a random observation drawn from that population. • Populations can be summarized using parameters that mirror the summary statistics of sample data. • Often data does not arise as a sample from a
...more
CHAPTER 4 What Causes What?
ascertainment bias
‘Correlation Does Not Imply Causation’
What Is ‘Causation’ Anyway?
counter-factual.
So we can never say that X caused Y in a specific case, only that X increases the proportion of times that Y happens.
This has two vital consequences for what we have to do if we want to know what causes what. First, in order to infer causation with real confidence, we ideally need to intervene and perform experiments. Second, since this is a statistical or stochastic world, we need to intervene more than once in order to amass evidence.
proper medical trial should ideally obey the following principles:
1. Controls: If we want to investigate the effect of statins on a population, we can’t just give statins to a few people, and then, if they don’t have a heart attack, claim this was due to the pill (regardless of the websites that use this form of anecdotal reasoning to market their products). We need an intervention group, who will be given statins, and a control group who will be given sugar pills or placebos.
2. Allocation of treatment: It is important to compare like with like, so the treatment and comparison groups ha...
This highlight has been truncated due to consecutive passage length restrictions.
3. People should be counted in the groups to which they were allocated: The people allocated to the ‘statin’ group in the Heart Protection Study (HPS) were included in the final analysis even if they did not take their statins. This is known as the ‘intention to treat’ principle, and can seem rather odd. It means that the final estimate of the effect of statins really measures the effect of being prescribed statins rather than actually taking them.
4. If possible, people should not even know which group they are in:
blinded to the treatment
5. Groups should be treated equally:
6. If possible, those assessing the final outcomes should not know which group the subjects are in:
7. Measure everyone:
The main recent innovation in randomized experimentation concerns ‘A/B’ testing in web design, in which users are (unknowingly) directed to alternative layouts for web pages, and measurements made of time spent on pages, click-throughs to advertisements, and so on.
What Do We Do When We Can’t Randomize?
When the data does not arise from an experiment, it is said to be observational.
What Can We Do When We Observe an Association?
When an apparent association between two outcomes might be explained by some observed common factor that influences both, this common cause is known as a
confounder:
The simplest technique for dealing with confounders is to look at the apparent relationship within each level of the confounder. This is known as adjustment, or stratification.
Simpson’s Paradox
Simpson’s paradox, which occurs when the apparent direction of an association is reversed by adjusting for a confounding factor, requiring a complete change in the apparent lesson from the data.