More on this book
Community
Kindle Notes & Highlights
Read between
February 19 - June 10, 2020
Since we cannot repeatedly draw a new sample from the population, we instead repeatedly draw ...
This highlight has been truncated due to consecutive passage length restrictions.
We therefore get an idea of how our estimate varies through this process of resampling with replacement. This is known as bootstrapping the data—the magical idea of pulling oneself up by one’s own bootstraps is reflected in this ability to learn about the variability in an estimate without having to make any assumptions about the shape of the population distribution.
sampling distributions of estimates, since they reflect the variability in estimates that arise
The first, and perhaps most notable, is that almost all trace of the skewness of the original samples has gone—the distributions of the estimates based on the resampled data are almost symmetric around the mean of the original data. This is a first glimpse of what is known as the Central Limit Theorem, which says that the distribution of sample means tends towards the form of a normal distribution with increasing sample size, almost regardless of the shape of the original data distribution.
Crucially, these bootstrap distributions allow us to quantify our uncertainty about the estimates shown in Table 7.1. For example, we can find the range of values that contains 95% of the means of the bootstrap resamples, and call this a 95% uncertainty interval for the original estimates, or alternatively they can be called margins of error.
The second important feature of Figure 7.3 is that the bootstrap distributions get narrower as the sample size increases, which is reflected in the steadily narrower 95% uncertainty intervals.
This section has introduced some difficult but important ideas:
the variability in statistics based on samples • bootstrapping data when we do not want to make assumptions about the shape of the population • the fact that the shape of the distribution of the statistics does not depend on the shape of the ori...
This highlight has been truncated due to consecutive passage length restrictions.
In Chapter 5 I fitted regression lines to Galton’s height data, enabling predictions to be made of, say, a daughter’s height based on her mother’s height, using a regression line with an estimated gradient of 0.33 (Table 5.2). But how confident can we be about the position of that fitted line? Bootstrapping provides an intuitive way of answering this question without making any mathematical assumptions about the underlying population.
Bootstrapping provides an intuitive, computer-intensive way of assessing the uncertainty in our estimates, without making strong assumptions and without using probability theory. But the technique is not feasible when it comes to, say, working out the margins of error on unemployment surveys of 100,000 people. Although bootstrapping is a simple, brilliant and extraordinarily effective idea, it is just too clumsy to bootstrap such large quantities of data, especially when a convenient theory exists that can generate formulae for the width of uncertainty intervals.
Summary • Uncertainty intervals are an important part of communicating statistics. • Bootstrapping a sample consists of creating new data sets of the same size by resampling the original data, with replacement. • Sample statistics calculated from bootstrap resamples tend towards a normal distribution for larger data sets, regardless of the shape of the original data distribution. • Uncertainty intervals based on bootstrapping take advantage of modern computer power, do not require assumptions about the mathematical form of the population and do not require complex probability theory.
CHAPTER 8 Probability—the Language of Uncertainty and Variability

