The Art of Statistics: How to Learn from Data
Rate it:
Open Preview
Read between February 19 - June 10, 2020
25%
Flag icon
reverse causation.
25%
Flag icon
Potential common causes that we do not measure are known as lurking factors,
25%
Flag icon
Can We Ever Conclude Causation from Observational Data?
25%
Flag icon
Richard Doll in the 1950s he led the research that eventually confirmed the link between smoking and lung cancer. In 1965 he set out a list of criteria that needed to be considered before concluding that an observed link between an exposure and an outcome was causal, where an exposure might comprise anything from chemicals in the environment to habits such as smoking or lack of exercise. These criteria have been subsequently much debated, and the version shown below was developed by Jeremy Howick and colleagues, separated into what they call direct, mechanistic and parallel evidence.12
26%
Flag icon
Summary • Causation, in the statistical sense, means that when we intervene, the chances of different outcomes are systematically changed. • Causation is difficult to establish statistically, but well-designed randomized trials are the best available framework. • Principles of blinding, intention-to-treat and so on have enabled large-scale clinical trials to identify moderate but important effects. • Observational data may have background factors influencing the apparent observed relationships between an exposure and an outcome, which may be either observed confounders or lurking factors. • ...more
26%
Flag icon
CHAPTER 5 Modelling Relationships Using Regression
26%
Flag icon
statistical model, which is a formal representation of the relationships between variables, which we can use for the desired explanation or prediction.
27%
Flag icon
residual (the vertical dashed lines on the plot), which is the size of the error were we to use the line to predict a son’s height from his father’s.
27%
Flag icon
least-squares fitted line, for which the sum of the squares of the residuals is smallest.*
27%
Flag icon
Galton called this ‘regression to mediocrity’, whereas now it is known as regression to the mean.
27%
Flag icon
In basic regression analysis the dependent variable is the quantity that we want to predict or explain, usually forming the vertical y-axis of a graph—this is sometimes known as the response variable.
27%
Flag icon
The gradient is also known as the regression coefficient.
27%
Flag icon
Table 5.2 shows the correlations between parent and offspring heights, and the gradients of regression lines.* There is a simple relationship between the gradients, the Pearson correlation coefficient and the standard deviations of the variables.*
27%
Flag icon
The meaning of these gradients depends completely on our assumptions about the relationship between the variables being studied. For correlational data, the gradient indicates how much we would expect the dependent variable to change, on average, if we observe a one unit difference for the independent variable. For example, if Alice is one inch taller than Betty, we would predict Alice’s adult daughter to be 0.33 inches taller than Betty’s adult daughter.
27%
Flag icon
Regression Lines Are Models
27%
Flag icon
Statistical models have two main components.
27%
Flag icon
First, a mathematical formula that expresses a deterministic, predictable component,
28%
Flag icon
But the deterministic part of a model is not going to be a perfect representation of the observed world.
28%
Flag icon
the difference between what the model predicts, and what actually happens, is the second component of a model and is known as the residual error—although it is important to remember that in statistical modelling, ‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe.
28%
Flag icon
observation = deterministic model + residual error.
28%
Flag icon
This is the classic idea of the signal and the noise.
28%
Flag icon
This section contains a simple lesson: just because we act, and something changes, it doesn’t mean we were responsible for the result.
28%
Flag icon
Humans seem to find this simple truth difficult to grasp—we are always keen to construct an explanatory narrative, and even keener if we are at its centre.
28%
Flag icon
We have a strong psychological tendency to attribute change to intervention, and this makes before-and-after comparisons treacherous.
28%
Flag icon
But if we believe these runs of good or bad fortune represent a constant state of affairs, then we will wrongly attribute the reversion to normal as the consequence of any intervention
28%
Flag icon
Dealing With More Than One Explanatory Variable
28%
Flag icon
As an example of having more than one explanatory variable, we can look at how the height of a son or daughter is related to the height of their father and their mother.
28%
Flag icon
This is known as a multiple linear regression.*
28%
Flag icon
When we just had one explanatory variable the relationship with the response variable was summarized by a gradient, which can also be interpreted as a coefficient in a regression equation; this idea can be generalized to more than one explanatory variable.
29%
Flag icon
Different Types of Response Variables
29%
Flag icon
absurd. So a form of regression has been developed for proportions, called logistic regression, which ensures a curve which cannot go above 100% or below 0%.
29%
Flag icon
Beyond Basic Regression Modelling
29%
Flag icon
extraordinary increase in computing power have allowed far more sophisticated models to be developed. Very broadly, four main modelling strategies have been adopted by different communities of researchers:
29%
Flag icon
Rather simple mathematical representations for associations, such as the linear regression analyses in this chapter, which tend to be favoured by statisticians.
29%
Flag icon
Complex deterministic models based on scientific understanding of a physical process, such as those used in weather forecasting, which are intended to realistically represent underlying mechanisms, and w...
This highlight has been truncated due to consecutive passage length restrictions.
30%
Flag icon
Complex algorithms used to make a decision or prediction that have been derived from an analysis of huge numbers of past examples, for example to recommend books you might like to buy from an online retailer, and which co...
This highlight has been truncated due to consecutive passage length restrictions.
30%
Flag icon
Regression models that claim to reach causal conclusions, as favoured by economists.
30%
Flag icon
Summary • Regression models provide a mathematical representation between a set of explanatory variables and a response variable. • The coefficients in a regression model indicate how much we expect the response to change when the explanatory variable is observed to change. • Regression-to-the-mean occurs when more extreme responses revert to nearer the long-term average, since a contribution to their previous extremeness was pure chance. • Regression models can incorporate different types of response variable, explanatory variables and non-linear relationships. • Caution is required in ...more
30%
Flag icon
CHAPTER 6 Algorithms, Analytics and Prediction
30%
Flag icon
The theme behind this chapter is that such practical problems can be tackled by using past data to produce an algorithm, a mechanistic formula that will automatically produce an answer for each new case that comes along with either no, or minimal, additional human intervention: essentially, this is ‘technology’ rather than science.
30%
Flag icon
two broad tasks for such an algorithm:
30%
Flag icon
Classification (also known as discrimination or supervised learning): to say what kind of situation we’re facing. For example, the likes and dislikes of an online customer, or whether t...
This highlight has been truncated due to consecutive passage length restrictions.
30%
Flag icon
Prediction: to tell us what is going to happen. For example, what the weather will be next week, what a stock price might do tomorrow, what products that customer might buy, or whether that child ...
This highlight has been truncated due to consecutive passage length restrictions.
30%
Flag icon
they both have the same underlying nature: to take a set of observations relevant to a current situation, and map them to a relevant conclusion.
30%
Flag icon
This process has been termed predictive analytics, but we are verging into the territory of artificial intelligence (AI),
31%
Flag icon
Finding Patterns One strategy for dealing with an excessive number of cases is to identify groups that are similar, a process known as clustering or unsupervised learning,
31%
Flag icon
Before getting on with constructing an algorithm for classification or prediction, we may also have to reduce the raw data on each case to a manageable dimension due to excessively large p, that is too many features being measured on each case. This process is known as feature engineering.
31%
Flag icon
Classification and Prediction
32%
Flag icon
should age be considered as a categorical variable, banded into the categories shown in Figure 6–2, or a continuous variable?
32%
Flag icon
shall instead proceed straight to making predictions.