Maru Kun’s Kindle Notes & Highlights for The Art of Statistics: Learning from Data

The least-squares prediction line in Figure 5.1 goes through the middle of the cloud of points, representing the mean values for the heights of fathers and sons, but does not follow the diagonal line of ‘equality’. It is clearly lower than the line of equality for fathers who are taller than average, and higher than the line of equality for fathers who are shorter than average. This means that tall fathers tend to have sons who are slightly shorter than them, while shorter fathers have slightly taller sons. Galton called this ‘regression to mediocrity’, whereas now it is known as regression to ...more

29%

In basic regression analysis the dependent variable is the quantity that we want to predict or explain, usually forming the vertical y-axis of a graph – this is sometimes known as the response variable.

29%

The gradient is also known as the regression coefficient.

29%

The meaning of these gradients depends completely on our assumptions about the relationship between the variables being studied. For correlational data, the gradient indicates how much we would expect the dependent variable to change, on average, if we observe a one unit difference for the independent variable.

30%

If, however, we assumed a causal relationship then the gradient has a very different interpretation – it is the change we would expect in the dependent variable were we to intervene and change the independent variable to a value one unit higher. This is definitely not the case for heights since they cannot be altered by experimental means, at least for adults.

30%

The regression line we fitted between fathers’ and sons’ heights is a very basic example of a statistical model.

30%

Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, for example the fitted straight line that enables us to make a prediction of a son’s height from his father’s. But the deterministic part of a model is not going to be a perfect representation of the observed world.

30%

As we saw in Figure 5.1, there is a big scatter of heights around the regression line, and the difference between what the model predicts, and what actually happens, is the second component of a model and is known as the residual error – although it is important to remember that in statistical modelling, ‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe.

30%

This section contains a simple lesson: just because we act, and something changes, it doesn’t mean we were responsible for the result. Humans seem to find this simple truth difficult to grasp – we are always keen to construct an explanatory narrative, and even keener if we are at its centre.

30%

We have a strong psychological tendency to attribute change to intervention, and this makes before-and-after comparisons treacherous.

30%

But if we believe these runs of good or bad fortune represent a constant state of affairs, then we will wrongly attribute the reversion to normal as the consequence of any intervention we have made.

30%

But much of the improvement seen in people who do not receive any active treatment may be regression-to-the-mean, since patients are enrolled in trials when they are showing symptoms, and many of these would have resolved anyway.

31%

Since Galton’s early work there have been many extensions to the basic idea of regression, vastly helped by modern computing. These developments include: having many explanatory variables explanatory variables that are categories rather than numbers having relationships that are not straight lines and adapt flexibly to the pattern of the data response variables that not continuous variables, such as proportions and counts

31%

As an example of having more than one explanatory variable, we can look at how the height of a son or daughter is related to the height of their father and their mother.

31%

This is known as a multiple linear regression

31%

So the height of a parent has a slightly reduced association with their adult offspring’s height, when allowing for the effect of the other parent. This could be due to the fact that taller women tend to marry taller men, so that each parent’s height is not a completely independent factor.

31%

A regression analysis had the rate of tumours as the dependent, or response, variable, and education as the independent, or explanatory, variable of interest. Other factors entered into the regression included age at diagnosis, calendar year, region of Sweden, marital status and income, all of which were considered to be potential confounding variables. This adjustment for confounders is an attempt to tease out a purer relationship between education and brain tumours, but it can never be wholly adequate. There will always remain the suspicion that some other lurking process might be at work, ...more

31%

Not all data are continuous measurements such as height.

31%

Each type of dependent variable has its own form of multiple regression, with a correspondingly different interpretation of the estimated coefficients.

31%

While we could have fitted a linear regression line through these points, naïve extrapolation would suggest that if a hospital treated a huge number of cases, their survival would be predicted to be greater than 100%, which is absurd. So a form of regression has been developed for proportions, called logistic regression, which ensures a curve which cannot go above 100% or below 0%.

32%

Very broadly, four main modelling strategies have been adopted by different communities of researchers:

32%

Rather simple mathematical representations for associations, such as the linear regression analyses in this chapter, which tend to be favoured by statisticians.

32%

Complex deterministic models based on scientific understanding of a physical process, such as those used in weather forecasting, which are intended to realistically represent underlying mechanisms, and w...

This highlight has been truncated due to consecutive passage length restrictions.

32%

Complex algorithms used to make a decision or prediction that have been derived from an analysis of huge numbers of past examples, for example to recommend books you might like to buy from an online retailer, and which come from the world of computer science and machine learning. These will often be ‘black boxes’ in the sense that they may make go...

This highlight has been truncated due to consecutive passage length restrictions.

32%

Regression models that claim to reach causal conclusions, as fav...

This highlight has been truncated due to consecutive passage length restrictions.

32%

George Box has become famous for his brief but invaluable aphorism: ‘All models are wrong, some are useful.’

32%

But these cautions are easily forgotten. Once a model becomes accepted, and especially when it is out of the hands of those who created it and understand its limitations, then it can start acting as a sort of oracle.

32%

There are two broad tasks for such an algorithm: Classification (also known as discrimination or supervised learning): to say what kind of situation we’re facing. For example, the likes and dislikes of an online customer, or whether that object in a robot’s vision is a child or a dog. Prediction: to tell us what is going to happen. For example, what the weather will be next week, what a stock price might do tomorrow, what products that customer might buy, or whether that child is going to run out in front of our self-driving car. Although these tasks differ in whether they are concerned with ...more

33%

The other way that data can be ‘big’ is by measuring many characteristics, or features, on each example. This quantity is often known as p, perhaps denoting parameters.

33%

One strategy for dealing with an excessive number of cases is to identify groups that are similar, a process known as clustering or unsupervised learning, since we have to learn about these groups and are not told in advance that they exist.

33%

Before getting on with constructing an algorithm for classification or prediction, we may also have to reduce the raw data on each case to a manageable dimension due to excessively large p, that is too many features being measured on each case. This process is known as feature engineering

33%

Recent developments in extremely complex models, such as those labelled as deep learning, suggest that this initial stage of data reduction may not be necessary and the total raw data can be processed in a single algorithm.

33%

A bewildering range of alternative methods are now readily available for building classification and prediction algorithms.

33%

A particularly popular competition, with thousands of competing teams, is to produce an algorithm for the following challenge. Can we predict which passengers survived the sinking of the Titanic?

33%

Only around 700 of more than 2,200 passengers and crew on board got on to lifeboats and survived, and subsequent studies and fictional accounts have focused on the fact that your chances of getting on to a lifeboat and surviving crucially depended on what class of ticket you had.

34%

For the Analysis, it is crucial to split the data into a training set used to build the algorithm, and a test set that is kept apart and only used to assess performance – it would be serious cheating to look at the test set before we are ready with our algorithm.

34%

This is a real, and hence fairly messy, data set, and some pre-processing is required. Eighteen passengers have missing fare information, and they have been assumed to have paid the median fare for their class of travelling.

34%

Suppose we made the (demonstrably incorrect) prediction that ‘Nobody survived’. Then, since 61% of the passengers died, we would get 61% right in the training set. If we used the slightly more complex prediction rule, ‘All women survive and no men survive’, we would correctly classify 78% of the training set. These naïve rules serve as good baselines from which to measure any improvements obtained from more sophisticated algorithms.

34%

The classification tree shown in Figure 6.3 has an accuracy of 82% when applied to the training data on which it was developed. When the algorithm is applied to the test set the accuracy drops slightly to 81%. The numbers of the different types of errors made by the algorithm are shown in Table 6.1 – this is termed the error matrix, or sometimes the confusion matrix. If we are trying to detect survivors, the percentage of true survivors that are correctly predicted is known as the sensitivity of the algorithm, while the percentage of true non-survivors that are correctly predicted is known as ...more

34%

Algorithms that give a probability (or any number) rather than a simple classification are often compared using Receiver Operating Characteristic (ROC) curves, which were originally developed in the Second World War to analyse radar signals. The crucial insight is that we can vary the threshold at which people are predicted to survive.

35%

By considering all possible thresholds for predicting a survivor, the possible values for the specificity and sensitivity form a curve. Note that the specificity axis conventionally decreases from 1 to 0 when drawing an ROC curve.

35%

But how do we check how good these probabilities are? We cannot create a simple error matrix as in the classification tree, since the algorithm is never declaring categorically whether it will rain or not. We can create ROC curves, but these only examine whether days when it rains get higher predictions than when it doesn’t. The critical insight is that we also need calibration, in the sense that if we take all the days in which the forecaster says 70% chance of rain, then it really should rain on around 70% of those days. This is taken very seriously by weather forecasters – probabilities ...more

36%

If we were predicting a numerical quantity, such as the temperature at noon tomorrow in a particular place, the accuracy would usually be summarized by the error – the difference between the observed and predicted temperature. The usual summary of the error over a number of days is the mean-squared-error (MSE) – this is the average of the squares of the errors, and is analogous to the least-squares criterion we saw used in regression analysis.

36%

The average mean-squared-error is known as the Brier score, after meteorologist Glenn Brier, who described the method in 1950.

37%

Forecasters then create a ‘skill score’, which is the proportional reduction of the reference score: in our case, 0.61,fn4 meaning our algorithm has made a 61% improvement on a naïve forecaster who uses only climate data.

37%

Figure 6.6 shows such a tree, grown to include many detailed factors. This has an accuracy on the training set of 83%, better than the smaller tree. But when we apply this algorithm to the test data its accuracy drops to 81%, the same as the small tree, and its Brier score is 0.150, clearly worse than the simple tree’s 0.139. We have adapted the tree to the training data to such a degree that its predictive ability has started to decline.

37%

This is known as over-fitting, and is one of the most vital topics in algorithm construction.

37%

The rather strange set of questions suggests the tree has adapted too much to individual cases in the training set.

37%

Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off.

37%

Techniques for avoiding over-fitting include regularization, in which complex models are encouraged but the effects of the variables are pulled in towards zero. But perhaps the most common protection is to use the simple but powerful idea of cross-validation when constructing the algorithm.

See a Problem?

Preview — The Art of Statistics by David Spiegelhalter