More on this book
Community
Kindle Notes & Highlights
Read between
February 19 - June 10, 2020
This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’
the standard deviation is a widely used measure of spread. It is the most technically complex measure,
but is only really appropriate for well-behaved symmetric data* since it is also unduly influenced by outlying values.
benefits of using summary measures that are not unduly affected by odd observations such as 31,337—these are known as robust measures, and include the median and the inter-quartile range.
Describing Differences between Groups of Numbers
Describing Relationships Between Variables
It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient,
line. An alternative measure is called Spearman’s rank correlation after English psychologist Charles Spearman (who developed the idea of an underlying general intelligence), and depends only on the ranks of the data rather than their specific values.
Correlation coefficients are simply summaries of association, and cannot be used to conclude that there is definitely an underlying relationship between volume and survival rates,
In many applications the x-axis represents a quantity known as the independent variable, and interest focuses on its influence on the dependent variable plotted on the y-axis.
Describing Trends
However a logarithmic scale in Figure 2.7(b) separates out the continents, revealing the steeper gradient in Africa,
Communication
The first rule of communication is to shut up and listen, so that you can get to know about the audience for your communication,
The second rule of communication is to know what you want to achieve.
Hans Rosling, whose TED talks
Summary
• A variety of statistics can be used to summarize the empirical distribution of data-points, including measures of location and spread. • Skewed data distributions are common, and some summary statistics are very sensitive to outlying values. • Data summaries always hide some detail, and care is required so that important information is not lost. • Single sets of numbers can be visualized in strip-charts, box-and-whisker plots and histograms. • Consider transformations to better reveal patterns, and use the eye to detect patterns, outliers, similarities and clusters. • Look at pairs of
...more
CHAPTER 3 Why Are We Looking at Data Anyway? Populations and Measurement
The process of going from the raw responses in the survey to claims about the behaviour of the whole country can be broken down into a series of stages:
Learning from Data—the Process of ‘Inductive Inference’
Once we want to start generalizing from the data—learning something about the world outside our immediate observations—we need to ask ourselves the question, ‘Learn about what?’ And this requires us to confront the challenging idea of inductive inference.
In real life deduction is the process of using the rules of cold logic to work from general premises to particular conclusions.
induction works the other way, in taking particular instances and trying to work out general conclusions.
crucial distinction is that deduction is logically certain, whereas induction is generally uncertain.
Figure 3.1 represents inductive inference as a generic diagram, showing the steps involved in going from data to the eventual target of our investigation:
Going from data (Stage 1) to the sample (Stage 2): these are problems of measurement: is what we record in our data an accurate
reflection of what we are interested in? We want our data to be:
Reli...
This highlight has been truncated due to consecutive passage length restrictions.
in the sense of having low variability from occasion to occasion, and so being a prec...
This highlight has been truncated due to consecutive passage length restrictions.
V...
This highlight has been truncated due to consecutive passage length restrictions.
in the sense of measuring what you really want to measure, and not hav...
This highlight has been truncated due to consecutive passage length restrictions.
This can be tested to some extent by asking specific questions both at the start and end of the interview.
A survey would not be valid if the questions were biased in favour of a particular response.
We have seen how positive or negative framing of numbers can influence the impression given, and similarly the framing of a question can influence the response.
The responses to questions can also be influenced by what has been asked beforehand, a process known as
priming.
Going from sample (Stage 2) to study population (Stage 3): this depends on the fundamental quality of the study, also known as its
internal validity:
does the sample we observe accurately reflect what is going on in the group w...
This highlight has been truncated due to consecutive passage length restrictions.
This is where we come to the crucial way of avoiding bias...
This highlight has been truncated due to consecutive passage length restrictions.
George Gallup, who essentially invented the idea of the opinion poll in the 1930s, came up with a fine analogy for the value of random sampling. He said that if you have cooked a large pan of soup, you do not need to eat it all to find out if it needs more seasoning. You can just taste a spoonful, provided you have given it a good stir.
The idea of adequate ‘stirring’ is crucial: if you want to be able to generalize from the sample to the population, you need to make sure your sample is representative. Just having masses of data does not necessarily help guarantee a good sample and can even give false reassurance.
Going from study population (Stage 3) to target population (Stage 4):
We want our study to have
external validity.
When We Have All the Data
Figure 3.1,
When we have all the data, it is straightforward to produce statistics that describe what has been measured. But when we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims.
allocation bias