More on this book
Community
Kindle Notes & Highlights
Read between
January 12, 2021 - December 8, 2023
The first rule of communication is to shut up and listen, so that you can get to know about the audience for your communication, whether it might be politicians, professionals or the general public. We have to understand their inevitable limitations and any misunderstandings, and fight the temptation to be too sophisticated and clever, or put in too much detail.
The second rule of communication is to know what you want to achieve.
tall fathers tend to have sons who are slightly shorter than them, while shorter fathers have slightly taller sons.
regression to the
taller mothers tend to have daughters who are shorter than them, and shorter mothers tend to have taller daughters.
any process of fitting lines or curves to data came to be c...
This highlight has been truncated due to consecutive passage length restrictions.
In basic regression analysis the dependent variable is the quantity that we want to predict or explain, usually forming the vertical y-axis of a graph – this is sometimes known as the response variable. While the independent variable is the quantity that we use for doing the prediction or explanation, usually formin...
This highlight has been truncated due to consecutive passage length restrictions.
The US Federal Reserve define a model as a ‘representation of some aspect of the world which is based on simplifying assumptions’:
‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe.
observation = deterministic model + residual error.
form of regression has been developed for proportions, called logistic regression, which ensures a curve which cannot go above 100% or below 0%.
four main modelling strategies have been adopted by different communities of researchers: Rather simple mathematical representations for associations, such as the linear regression analyses in this chapter, which tend to be favoured by statisticians. Complex deterministic models based on scientific understanding of a physical process, such as those used in weather forecasting, which are intended to realistically represent underlying mechanisms, and which are generally developed by applied mathematicians. Complex algorithms used to make a decision or prediction that have been derived from an
...more
George Box has become famous for his brief but invaluable aphorism: ‘All models are wrong, some are useful.’
The financial crisis of 2007–2008 has to a large extent been blamed on the exaggerated trust placed in complex financial models used to determine the risk of, say, bundles of mortgages. These models assumed only a moderate correlation between mortgage failures, and worked well while the property market was booming. But when conditions changed and mortgages started failing, they tended to fail in droves: the models grossly underestimated the risks due to the correlations turning out to be far higher than supposed. Senior managers simply did not realize the frail basis on which these models were
...more
to take a set of observations relevant to a current situation, and map them to a relevant conclusion. This process has been termed predictive analytics, but we are verging into the territory of artificial intelligence (AI), in which algorithms embodied in machines are used either to carry out tasks that would normally require human involvement, or to provide expert-level advice to humans.
One strategy for dealing with an excessive number of cases is to identify groups that are similar, a process known as clustering or unsupervised learning, since we have to learn about these groups and are not told in advance that they exist.
to reduce the raw data on each case to a manageable dimension due to excessively large p, that is too many features being measured on each case. This process is known as feature
Recent developments in extremely complex models, such as those labelled as deep learning, suggest that this initial stage of data reduction may not be necessary and the total raw data can be processed in a single algorithm.
classification tree is perhaps the simplest form of algorithm, since it consists of a series of yes/no questions, the answer to each deciding the next question to be asked, until a conclusion is reached.
Algorithms that give a probability (or any number) rather than a simple classification are often compared using Receiver Operating Characteristic (ROC) curves, which were originally developed in the Second World War to analyse radar signals. The crucial insight is that we can vary the threshold at which people are predicted to survive.
weather forecasts are based on extremely complex computer models which encapsulate detailed mathematical formulae representing how weather develops from current conditions, and each run of the model produces a deterministic yes/no prediction of rain at a particular place and time. So to produce a probabilistic forecast, the model has to be run many times starting at slightly adjusted initial conditions, which produces a list of different ‘possible futures’, in some of which it rains and in some it doesn’t. Forecasters run an ‘ensemble’ of, say, fifty models, and if it rains in five of those
...more
The critical insight is that we also need calibration, in the sense that if we take all the days in which the forecaster says 70% chance of rain, then it really should rain on around 70% of those days. This is taken very seriously by weather forecasters – probabilities should mean what they say, and not be either over- or under-confident.
If we were predicting a numerical quantity, such as the temperature at noon tomorrow in a particular place, the accuracy would usually be summarized by the error – the difference between the observed and predicted temperature. The usual summary of the error over a number of days is the mean-squared-error (MSE) – this is the average of the squares of the errors, and is analogous to the least-squares criterion we saw used in regression analysis.
The average mean-squared-error is known as the Brier score, after meteorologist Glenn Brier, who described the method in 1950.
When it comes to the Titanic challenge, consider the naïve algorithm of just giving everyone a 39% probability of surviving, which is the overall proportion of survivors in the training set. This does not use any individual data, and is essentially the equivalent to predicting weather using climate records rather than information on the current circumstances. The Brier score for this ‘skill-less’ rule is 0.232. In contrast, the Brier score for the simple classification tree is 0.139, which is a 40% reduction from the naïve prediction, and so demonstrates considerable skill. Another way of
...more
We have adapted the tree to the training data to such a degree that its predictive ability has started to decline. This is known as over-fitting, and is one of the most vital topics in algorithm construction. By making an algorithm too complex, we essentially start fitting the noise rather than the signal.
Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off
mimic having an independent test set by removing say 10% of the training data, developing the algorithm on the remaining 90%, and testing on the removed 10%. This is cross-validation, and can be carried out systematically by removing 10% in turn and repeating the procedure ten times, a procedure known as tenfold cross-validation.
the standard procedure for building classification trees is to first construct a very deep tree with many branches that is deliberately over-fitted, and then prune the tree back to something simpler and more robust: this pruning is controlled by a complexity parameter.
more sophisticated regression approaches are available for dealing with large and complex problems, such as non-linear models and a process known as the LASSO, that simultaneously estimates coefficients and selects relevant predictor variables, essentially by estimating their coefficients to be zero.
Classification trees and regression models arise from somewhat different modelling philosophies: trees attempt to construct simple rules that identify groups of cases with similar expected outcomes, while regression models focus on the weight to be given to specific features, regardless of what else is observed on a case.
Random forests comprise a large number of trees, each producing a classification, with the final classification decided by a majority vote, a process known as bagging.
Support vector machines try to find linear combinations of features that best split the different outcomes.
Neural networks comprise layers of nodes, each node depending on the previous layer by weights, rather like a series of logistic regressions piled on top of each other. Weights are learned by an optimization procedure, and, rather like random forests, multiple neural networks can be constructed and averaged. Neural networks with many layers have become known as deep-learning models: Google’s Incept...
This highlight has been truncated due to consecutive passage length restrictions.
K-nearest-neighbour classifies according to the majority outcome among close ca...
This highlight has been truncated due to consecutive passage length restrictions.
major problem is that these algorithms tend to be inscrutable black boxes – they come up with a prediction, but it is almost impossible to work out what is going on inside.
This has three negative aspects. First, extreme complexity makes implementation and upgrading a great effort: when Netflix offered a $1m prize for prediction recommendation systems, the winner was so complicated that Netflix ended up not using it.
The second negative feature is that we do not know how the conclusion was arrived at, or what confidence we should have in it: we just have to take it or leave it. Sim...
This highlight has been truncated due to consecutive passage length restrictions.
Finally, if we do not know how an algorithm is producing its answer, we cannot investigate it for implicit but systematic biases ...
This highlight has been truncated due to consecutive passage length restrictions.
The sample size should affect your confidence in the estimate, and knowing exactly how much difference it makes is a basic necessity for proper statistical inference.
bootstrapping the data – the magical idea of pulling oneself up by one’s own bootstraps
sampling distributions of estimates, since they reflect the variability in estimates that arise from repeated sampling of data.
first, and perhaps most notable, is that almost all trace of the skewness of the original samples has gone – the distributions of the estimates based on the resampled data are almost symmetric around the mean of the original data. This is a first glimpse of what is known as the Central Limit Theorem, which says that the distribution of sample means tends towards the form of a normal distribution with increasing sample size, almost regardless of the shape of the original data distribution.
bootstrap distributions get narrower as the sample size increases, which is reflected in the steadily narrower 95% uncertainty intervals.
to emphasize that all the probabilities we use are conditional – there is no such thing as the unconditional probability of an event; there are always assumptions and other factors that could affect the probability. And, as we now see, we need to be careful about what we condition on.
any numerical probability is essentially constructed according to what is known in the current situation – indeed probability doesn’t really ‘exist’ at all (except possibly at the subatomic level). This approach forms the basis for the Bayesian school of statistical inference,
The implications of probability are not intuitive, but insights can be improved by using the idea of expected frequencies.
Many social phenomena show a remarkable regularity in their overall pattern, while individual events are entirely unpredictable.
to go from a single sample back to saying something about a possible population. This is the process of inductive inference
Suppose I have a coin, and I ask you for your probability that it will come up heads. You happily answer ‘50:50’, or similar. Then I flip it, cover up the result before either of us sees it, and again ask for your probability that it is heads. If you are typical of my experience, you may, after a pause, rather grudgingly say ‘50:50’. Then I take a quick look at the coin, without showing you, and repeat the question. Again, if you are like most people, you eventually mumble ‘50:50’. This simple exercise reveals a major distinction between two types of uncertainty: what is known as aleatory
...more