The Art of Statistics: Learning from Data
Rate it:
Open Preview
Read between January 12, 2021 - December 8, 2023
18%
Flag icon
The first rule of communication is to shut up and listen, so that you can get to know about the audience for your communication, whether it might be politicians, professionals or the general public. We have to understand their inevitable limitations and any misunderstandings, and fight the temptation to be too sophisticated and clever, or put in too much detail.
19%
Flag icon
The second rule of communication is to know what you want to achieve.
29%
Flag icon
tall fathers tend to have sons who are slightly shorter than them, while shorter fathers have slightly taller sons.
29%
Flag icon
regression to the
29%
Flag icon
taller mothers tend to have daughters who are shorter than them, and shorter mothers tend to have taller daughters.
29%
Flag icon
any process of fitting lines or curves to data came to be c...
This highlight has been truncated due to consecutive passage length restrictions.
29%
Flag icon
In basic regression analysis the dependent variable is the quantity that we want to predict or explain, usually forming the vertical y-axis of a graph – this is sometimes known as the response variable. While the independent variable is the quantity that we use for doing the prediction or explanation, usually formin...
This highlight has been truncated due to consecutive passage length restrictions.
30%
Flag icon
The US Federal Reserve define a model as a ‘representation of some aspect of the world which is based on simplifying assumptions’:
30%
Flag icon
‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe.
30%
Flag icon
observation = deterministic model + residual error.
31%
Flag icon
form of regression has been developed for proportions, called logistic regression, which ensures a curve which cannot go above 100% or below 0%.
32%
Flag icon
four main modelling strategies have been adopted by different communities of researchers: Rather simple mathematical representations for associations, such as the linear regression analyses in this chapter, which tend to be favoured by statisticians. Complex deterministic models based on scientific understanding of a physical process, such as those used in weather forecasting, which are intended to realistically represent underlying mechanisms, and which are generally developed by applied mathematicians. Complex algorithms used to make a decision or prediction that have been derived from an ...more
32%
Flag icon
George Box has become famous for his brief but invaluable aphorism: ‘All models are wrong, some are useful.’
32%
Flag icon
The financial crisis of 2007–2008 has to a large extent been blamed on the exaggerated trust placed in complex financial models used to determine the risk of, say, bundles of mortgages. These models assumed only a moderate correlation between mortgage failures, and worked well while the property market was booming. But when conditions changed and mortgages started failing, they tended to fail in droves: the models grossly underestimated the risks due to the correlations turning out to be far higher than supposed. Senior managers simply did not realize the frail basis on which these models were ...more
32%
Flag icon
to take a set of observations relevant to a current situation, and map them to a relevant conclusion. This process has been termed predictive analytics, but we are verging into the territory of artificial intelligence (AI), in which algorithms embodied in machines are used either to carry out tasks that would normally require human involvement, or to provide expert-level advice to humans.
33%
Flag icon
One strategy for dealing with an excessive number of cases is to identify groups that are similar, a process known as clustering or unsupervised learning, since we have to learn about these groups and are not told in advance that they exist.
33%
Flag icon
to reduce the raw data on each case to a manageable dimension due to excessively large p, that is too many features being measured on each case. This process is known as feature
33%
Flag icon
Recent developments in extremely complex models, such as those labelled as deep learning, suggest that this initial stage of data reduction may not be necessary and the total raw data can be processed in a single algorithm.
34%
Flag icon
classification tree is perhaps the simplest form of algorithm, since it consists of a series of yes/no questions, the answer to each deciding the next question to be asked, until a conclusion is reached.
34%
Flag icon
Algorithms that give a probability (or any number) rather than a simple classification are often compared using Receiver Operating Characteristic (ROC) curves, which were originally developed in the Second World War to analyse radar signals. The crucial insight is that we can vary the threshold at which people are predicted to survive.
35%
Flag icon
weather forecasts are based on extremely complex computer models which encapsulate detailed mathematical formulae representing how weather develops from current conditions, and each run of the model produces a deterministic yes/no prediction of rain at a particular place and time. So to produce a probabilistic forecast, the model has to be run many times starting at slightly adjusted initial conditions, which produces a list of different ‘possible futures’, in some of which it rains and in some it doesn’t. Forecasters run an ‘ensemble’ of, say, fifty models, and if it rains in five of those ...more
35%
Flag icon
The critical insight is that we also need calibration, in the sense that if we take all the days in which the forecaster says 70% chance of rain, then it really should rain on around 70% of those days. This is taken very seriously by weather forecasters – probabilities should mean what they say, and not be either over- or under-confident.
36%
Flag icon
If we were predicting a numerical quantity, such as the temperature at noon tomorrow in a particular place, the accuracy would usually be summarized by the error – the difference between the observed and predicted temperature. The usual summary of the error over a number of days is the mean-squared-error (MSE) – this is the average of the squares of the errors, and is analogous to the least-squares criterion we saw used in regression analysis.
36%
Flag icon
The average mean-squared-error is known as the Brier score, after meteorologist Glenn Brier, who described the method in 1950.
37%
Flag icon
When it comes to the Titanic challenge, consider the naïve algorithm of just giving everyone a 39% probability of surviving, which is the overall proportion of survivors in the training set. This does not use any individual data, and is essentially the equivalent to predicting weather using climate records rather than information on the current circumstances. The Brier score for this ‘skill-less’ rule is 0.232. In contrast, the Brier score for the simple classification tree is 0.139, which is a 40% reduction from the naïve prediction, and so demonstrates considerable skill. Another way of ...more
37%
Flag icon
We have adapted the tree to the training data to such a degree that its predictive ability has started to decline. This is known as over-fitting, and is one of the most vital topics in algorithm construction. By making an algorithm too complex, we essentially start fitting the noise rather than the signal.
37%
Flag icon
Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off
37%
Flag icon
mimic having an independent test set by removing say 10% of the training data, developing the algorithm on the remaining 90%, and testing on the removed 10%. This is cross-validation, and can be carried out systematically by removing 10% in turn and repeating the procedure ten times, a procedure known as tenfold cross-validation.
37%
Flag icon
the standard procedure for building classification trees is to first construct a very deep tree with many branches that is deliberately over-fitted, and then prune the tree back to something simpler and more robust: this pruning is controlled by a complexity parameter.
38%
Flag icon
more sophisticated regression approaches are available for dealing with large and complex problems, such as non-linear models and a process known as the LASSO, that simultaneously estimates coefficients and selects relevant predictor variables, essentially by estimating their coefficients to be zero.
38%
Flag icon
Classification trees and regression models arise from somewhat different modelling philosophies: trees attempt to construct simple rules that identify groups of cases with similar expected outcomes, while regression models focus on the weight to be given to specific features, regardless of what else is observed on a case.
38%
Flag icon
Random forests comprise a large number of trees, each producing a classification, with the final classification decided by a majority vote, a process known as bagging.
38%
Flag icon
Support vector machines try to find linear combinations of features that best split the different outcomes.
38%
Flag icon
Neural networks comprise layers of nodes, each node depending on the previous layer by weights, rather like a series of logistic regressions piled on top of each other. Weights are learned by an optimization procedure, and, rather like random forests, multiple neural networks can be constructed and averaged. Neural networks with many layers have become known as deep-learning models: Google’s Incept...
This highlight has been truncated due to consecutive passage length restrictions.
38%
Flag icon
K-nearest-neighbour classifies according to the majority outcome among close ca...
This highlight has been truncated due to consecutive passage length restrictions.
38%
Flag icon
major problem is that these algorithms tend to be inscrutable black boxes – they come up with a prediction, but it is almost impossible to work out what is going on inside.
38%
Flag icon
This has three negative aspects. First, extreme complexity makes implementation and upgrading a great effort: when Netflix offered a $1m prize for prediction recommendation systems, the winner was so complicated that Netflix ended up not using it.
38%
Flag icon
The second negative feature is that we do not know how the conclusion was arrived at, or what confidence we should have in it: we just have to take it or leave it. Sim...
This highlight has been truncated due to consecutive passage length restrictions.
38%
Flag icon
Finally, if we do not know how an algorithm is producing its answer, we cannot investigate it for implicit but systematic biases ...
This highlight has been truncated due to consecutive passage length restrictions.
41%
Flag icon
The sample size should affect your confidence in the estimate, and knowing exactly how much difference it makes is a basic necessity for proper statistical inference.
42%
Flag icon
bootstrapping the data – the magical idea of pulling oneself up by one’s own bootstraps
42%
Flag icon
sampling distributions of estimates, since they reflect the variability in estimates that arise from repeated sampling of data.
42%
Flag icon
first, and perhaps most notable, is that almost all trace of the skewness of the original samples has gone – the distributions of the estimates based on the resampled data are almost symmetric around the mean of the original data. This is a first glimpse of what is known as the Central Limit Theorem, which says that the distribution of sample means tends towards the form of a normal distribution with increasing sample size, almost regardless of the shape of the original data distribution.
42%
Flag icon
bootstrap distributions get narrower as the sample size increases, which is reflected in the steadily narrower 95% uncertainty intervals.
44%
Flag icon
to emphasize that all the probabilities we use are conditional – there is no such thing as the unconditional probability of an event; there are always assumptions and other factors that could affect the probability. And, as we now see, we need to be careful about what we condition on.
45%
Flag icon
any numerical probability is essentially constructed according to what is known in the current situation – indeed probability doesn’t really ‘exist’ at all (except possibly at the subatomic level). This approach forms the basis for the Bayesian school of statistical inference,
47%
Flag icon
The implications of probability are not intuitive, but insights can be improved by using the idea of expected frequencies.
47%
Flag icon
Many social phenomena show a remarkable regularity in their overall pattern, while individual events are entirely unpredictable.
48%
Flag icon
to go from a single sample back to saying something about a possible population. This is the process of inductive inference
49%
Flag icon
Suppose I have a coin, and I ask you for your probability that it will come up heads. You happily answer ‘50:50’, or similar. Then I flip it, cover up the result before either of us sees it, and again ask for your probability that it is heads. If you are typical of my experience, you may, after a pause, rather grudgingly say ‘50:50’. Then I take a quick look at the coin, without showing you, and repeat the question. Again, if you are like most people, you eventually mumble ‘50:50’. This simple exercise reveals a major distinction between two types of uncertainty: what is known as aleatory ...more
« Prev 1