More on this book
Community
Kindle Notes & Highlights
Read between
February 19 - June 10, 2020
reverse causation.
Potential common causes that we do not measure are known as lurking factors,
Can We Ever Conclude Causation from Observational Data?
Richard Doll in the 1950s he led the research that eventually confirmed the link between smoking and lung cancer. In 1965 he set out a list of criteria that needed to be considered before concluding that an observed link between an exposure and an outcome was causal, where an exposure might comprise anything from chemicals in the environment to habits such as smoking or lack of exercise. These criteria have been subsequently much debated, and the version shown below was developed by Jeremy Howick and colleagues, separated into what they call direct, mechanistic and parallel evidence.12
Summary • Causation, in the statistical sense, means that when we intervene, the chances of different outcomes are systematically changed. • Causation is difficult to establish statistically, but well-designed randomized trials are the best available framework. • Principles of blinding, intention-to-treat and so on have enabled large-scale clinical trials to identify moderate but important effects. • Observational data may have background factors influencing the apparent observed relationships between an exposure and an outcome, which may be either observed confounders or lurking factors. •
...more
CHAPTER 5 Modelling Relationships Using Regression
statistical model, which is a formal representation of the relationships between variables, which we can use for the desired explanation or prediction.
residual (the vertical dashed lines on the plot), which is the size of the error were we to use the line to predict a son’s height from his father’s.
least-squares fitted line, for which the sum of the squares of the residuals is smallest.*
Galton called this ‘regression to mediocrity’, whereas now it is known as regression to the mean.
In basic regression analysis the dependent variable is the quantity that we want to predict or explain, usually forming the vertical y-axis of a graph—this is sometimes known as the response variable.
The gradient is also known as the regression coefficient.
Table 5.2 shows the correlations between parent and offspring heights, and the gradients of regression lines.* There is a simple relationship between the gradients, the Pearson correlation coefficient and the standard deviations of the variables.*
The meaning of these gradients depends completely on our assumptions about the relationship between the variables being studied. For correlational data, the gradient indicates how much we would expect the dependent variable to change, on average, if we observe a one unit difference for the independent variable. For example, if Alice is one inch taller than Betty, we would predict Alice’s adult daughter to be 0.33 inches taller than Betty’s adult daughter.
Regression Lines Are Models
Statistical models have two main components.
First, a mathematical formula that expresses a deterministic, predictable component,
But the deterministic part of a model is not going to be a perfect representation of the observed world.
the difference between what the model predicts, and what actually happens, is the second component of a model and is known as the residual error—although it is important to remember that in statistical modelling, ‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe.
observation = deterministic model + residual error.
This is the classic idea of the signal and the noise.
This section contains a simple lesson: just because we act, and something changes, it doesn’t mean we were responsible for the result.
Humans seem to find this simple truth difficult to grasp—we are always keen to construct an explanatory narrative, and even keener if we are at its centre.
We have a strong psychological tendency to attribute change to intervention, and this makes before-and-after comparisons treacherous.
But if we believe these runs of good or bad fortune represent a constant state of affairs, then we will wrongly attribute the reversion to normal as the consequence of any intervention
Dealing With More Than One Explanatory Variable
As an example of having more than one explanatory variable, we can look at how the height of a son or daughter is related to the height of their father and their mother.
This is known as a multiple linear regression.*
When we just had one explanatory variable the relationship with the response variable was summarized by a gradient, which can also be interpreted as a coefficient in a regression equation; this idea can be generalized to more than one explanatory variable.
Different Types of Response Variables
absurd. So a form of regression has been developed for proportions, called logistic regression, which ensures a curve which cannot go above 100% or below 0%.
Beyond Basic Regression Modelling
extraordinary increase in computing power have allowed far more sophisticated models to be developed. Very broadly, four main modelling strategies have been adopted by different communities of researchers:
Rather simple mathematical representations for associations, such as the linear regression analyses in this chapter, which tend to be favoured by statisticians.
Complex deterministic models based on scientific understanding of a physical process, such as those used in weather forecasting, which are intended to realistically represent underlying mechanisms, and w...
This highlight has been truncated due to consecutive passage length restrictions.
Complex algorithms used to make a decision or prediction that have been derived from an analysis of huge numbers of past examples, for example to recommend books you might like to buy from an online retailer, and which co...
This highlight has been truncated due to consecutive passage length restrictions.
Regression models that claim to reach causal conclusions, as favoured by economists.
Summary • Regression models provide a mathematical representation between a set of explanatory variables and a response variable. • The coefficients in a regression model indicate how much we expect the response to change when the explanatory variable is observed to change. • Regression-to-the-mean occurs when more extreme responses revert to nearer the long-term average, since a contribution to their previous extremeness was pure chance. • Regression models can incorporate different types of response variable, explanatory variables and non-linear relationships. • Caution is required in
...more
CHAPTER 6 Algorithms, Analytics and Prediction
The theme behind this chapter is that such practical problems can be tackled by using past data to produce an algorithm, a mechanistic formula that will automatically produce an answer for each new case that comes along with either no, or minimal, additional human intervention: essentially, this is ‘technology’ rather than science.
two broad tasks for such an algorithm:
Classification (also known as discrimination or supervised learning): to say what kind of situation we’re facing. For example, the likes and dislikes of an online customer, or whether t...
This highlight has been truncated due to consecutive passage length restrictions.
Prediction: to tell us what is going to happen. For example, what the weather will be next week, what a stock price might do tomorrow, what products that customer might buy, or whether that child ...
This highlight has been truncated due to consecutive passage length restrictions.
they both have the same underlying nature: to take a set of observations relevant to a current situation, and map them to a relevant conclusion.
This process has been termed predictive analytics, but we are verging into the territory of artificial intelligence (AI),
Finding Patterns One strategy for dealing with an excessive number of cases is to identify groups that are similar, a process known as clustering or unsupervised learning,
Before getting on with constructing an algorithm for classification or prediction, we may also have to reduce the raw data on each case to a manageable dimension due to excessively large p, that is too many features being measured on each case. This process is known as feature engineering.
Classification and Prediction
should age be considered as a categorical variable, banded into the categories shown in Figure 6–2, or a continuous variable?
shall instead proceed straight to making predictions.