The Art of Statistics: How to Learn from Data
Rate it:
Open Preview
Read between February 19 - June 10, 2020
13%
Flag icon
This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’
13%
Flag icon
the standard deviation is a widely used measure of spread. It is the most technically complex measure,
13%
Flag icon
but is only really appropriate for well-behaved symmetric data* since it is also unduly influenced by outlying values.
13%
Flag icon
benefits of using summary measures that are not unduly affected by odd observations such as 31,337—these are known as robust measures, and include the median and the inter-quartile range.
13%
Flag icon
Describing Differences between Groups of Numbers
14%
Flag icon
Describing Relationships Between Variables
14%
Flag icon
It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient,
14%
Flag icon
line. An alternative measure is called Spearman’s rank correlation after English psychologist Charles Spearman (who developed the idea of an underlying general intelligence), and depends only on the ranks of the data rather than their specific values.
14%
Flag icon
Correlation coefficients are simply summaries of association, and cannot be used to conclude that there is definitely an underlying relationship between volume and survival rates,
14%
Flag icon
In many applications the x-axis represents a quantity known as the independent variable, and interest focuses on its influence on the dependent variable plotted on the y-axis.
14%
Flag icon
Describing Trends
15%
Flag icon
However a logarithmic scale in Figure 2.7(b) separates out the continents, revealing the steeper gradient in Africa,
15%
Flag icon
Communication
16%
Flag icon
The first rule of communication is to shut up and listen, so that you can get to know about the audience for your communication,
16%
Flag icon
The second rule of communication is to know what you want to achieve.
16%
Flag icon
Hans Rosling, whose TED talks
16%
Flag icon
Summary
16%
Flag icon
• A variety of statistics can be used to summarize the empirical distribution of data-points, including measures of location and spread. • Skewed data distributions are common, and some summary statistics are very sensitive to outlying values. • Data summaries always hide some detail, and care is required so that important information is not lost. • Single sets of numbers can be visualized in strip-charts, box-and-whisker plots and histograms. • Consider transformations to better reveal patterns, and use the eye to detect patterns, outliers, similarities and clusters. • Look at pairs of ...more
16%
Flag icon
CHAPTER 3 Why Are We Looking at Data Anyway? Populations and Measurement
17%
Flag icon
The process of going from the raw responses in the survey to claims about the behaviour of the whole country can be broken down into a series of stages:
17%
Flag icon
Learning from Data—the Process of ‘Inductive Inference’
17%
Flag icon
Once we want to start generalizing from the data—learning something about the world outside our immediate observations—we need to ask ourselves the question, ‘Learn about what?’ And this requires us to confront the challenging idea of inductive inference.
17%
Flag icon
In real life deduction is the process of using the rules of cold logic to work from general premises to particular conclusions.
17%
Flag icon
induction works the other way, in taking particular instances and trying to work out general conclusions.
17%
Flag icon
crucial distinction is that deduction is logically certain, whereas induction is generally uncertain.
17%
Flag icon
Figure 3.1 represents inductive inference as a generic diagram, showing the steps involved in going from data to the eventual target of our investigation:
17%
Flag icon
Going from data (Stage 1) to the sample (Stage 2): these are problems of measurement: is what we record in our data an accurate
17%
Flag icon
reflection of what we are interested in? We want our data to be:
17%
Flag icon
Reli...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
in the sense of having low variability from occasion to occasion, and so being a prec...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
V...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
in the sense of measuring what you really want to measure, and not hav...
This highlight has been truncated due to consecutive passage length restrictions.
18%
Flag icon
This can be tested to some extent by asking specific questions both at the start and end of the interview.
18%
Flag icon
A survey would not be valid if the questions were biased in favour of a particular response.
18%
Flag icon
We have seen how positive or negative framing of numbers can influence the impression given, and similarly the framing of a question can influence the response.
18%
Flag icon
The responses to questions can also be influenced by what has been asked beforehand, a process known as
18%
Flag icon
priming.
18%
Flag icon
Going from sample (Stage 2) to study population (Stage 3): this depends on the fundamental quality of the study, also known as its
18%
Flag icon
internal validity:
18%
Flag icon
does the sample we observe accurately reflect what is going on in the group w...
This highlight has been truncated due to consecutive passage length restrictions.
18%
Flag icon
This is where we come to the crucial way of avoiding bias...
This highlight has been truncated due to consecutive passage length restrictions.
18%
Flag icon
George Gallup, who essentially invented the idea of the opinion poll in the 1930s, came up with a fine analogy for the value of random sampling. He said that if you have cooked a large pan of soup, you do not need to eat it all to find out if it needs more seasoning. You can just taste a spoonful, provided you have given it a good stir.
18%
Flag icon
The idea of adequate ‘stirring’ is crucial: if you want to be able to generalize from the sample to the population, you need to make sure your sample is representative. Just having masses of data does not necessarily help guarantee a good sample and can even give false reassurance.
18%
Flag icon
Going from study population (Stage 3) to target population (Stage 4):
18%
Flag icon
We want our study to have
18%
Flag icon
external validity.
18%
Flag icon
When We Have All the Data
19%
Flag icon
Figure 3.1,
19%
Flag icon
When we have all the data, it is straightforward to produce statistics that describe what has been measured. But when we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims.
19%
Flag icon
allocation bias