More on this book
Community
Kindle Notes & Highlights
It is essential to test any predictions on an independent test set that was not used in the training of the algorithm, but that only happens at the end of the development process. So although it might show up our over-fitting at that time, it does not build us a better algorithm. We can, however, mimic having an independent test set by removing say 10% of the training data, developing the algorithm on the remaining 90%, and testing on the removed 10%. This is cross-validation, and can be carried out systematically by removing 10% in turn and repeating the procedure ten times, a procedure known
...more
For example, the standard procedure for building classification trees is to first construct a very deep tree with many branches that is deliberately over-fitted, and then prune the tree back to something simpler and more robust: this pruning is controlled by a complexity parameter.
Classification trees and regression models arise from somewhat different modelling philosophies: trees attempt to construct simple rules that identify groups of cases with similar expected outcomes, while regression models focus on the weight to be given to specific features, regardless of what else is observed on a case.
Random forests comprise a large number of trees, each producing a classification, with the final classification decided by a majority vote, a process known as bagging.
Support vector machines try to find linear combinations of features that best split the different outcomes.
Neural networks comprise layers of nodes, each node depending on the previous layer by weights, rather like a series of logistic re...
This highlight has been truncated due to consecutive passage length restrictions.
Later, in Chapter 10, we shall check whether we can confidently claim there is a proper winner on any of these criteria, since the winning margins might be so small that it can be explained by chance variation – say in who happened to end up in the test and training set. This reflects a general concern that algorithms that win Kaggle competitions tend to be very complex in order to achieve that tiny final margin needed to win.
Algorithms can display remarkable performance, but as their role in society increases so their potential problems become highlighted. Four main concerns can be identified.
Lack of robustness: Algorithms are derived from associations, and since they do not understand underlying processes, they can be overly sensitive to changes.
Not accounting for statistical variability: Automated rankings based on limited data will be unreliable. Teachers in the US have been ranked and penalized for the performance of their students in a single year, although class sizes of less than thirty do not provide a reliable basis for assessing the value added by a teacher.
When a vision algorithm was trained to discriminate pictures of huskies from German Shepherds, it was very effective until it failed on huskies that were kept as pets – it turned out that its apparent skill was based on identifying snow in the background.
life insurance cannot use race or any genetic information except Huntingdon’s disease, and so on. But we can still get an idea of the influence of different factors by systematically lying and seeing how the quotation changes: this allows a certain degree of reverse-engineering of the algorithm to see what is driving the premium.
was working on computer-aided diagnosis and handling uncertainty in AI in the 1980s, when much of the discourse was framed in terms of a competition between approaches based on probability and statistics, those based on encapsulating expert ‘rules’ of judgement or those trying to emulate cognitive capacities through neural networks. The field has now matured, with a more pragmatic and ecumenical approach to its underlying philosophy, although the hype has not gone away.
Systems such as Predict, which previously would be thought of as statistics-based decision-support systems, might now reasonably be called AI.fn7
Many of the challenges listed above come down to algorithms only modelling associations, and not having an idea of underlying causal processes.
We saw the idea of population summaries illustrated with the birth-weight data in Chapter 3, where we called the sample mean a statistic, and the population mean a parameter. In more technical statistical writing, these two figures are generally distinguished by giving them Roman and Greek letters respectively, in a possibly doomed attempt to avoid confusion; for example m often represents a sample mean, while the Greek μ (mu) is a population mean, and s generally represents a sample standard deviation, σ (sigma) a population standard deviation.
Now we come to a critical step. In order to work out how accurate these statistics might be, we need to think of how much our statistics might change if we (in our imagination) were to repeat the sampling process many times. In other words, if we repeatedly drew samples of 760 men from the country, how much would the calculated statistics vary?
There are two ways to resolve this circularity. The first is to make some mathematical assumptions about the shape of the population distribution, and use sophisticated probability theory to work out the variability we would expect in our estimate, and hence how far away we might expect, say, the average of our sample to be from the mean of the population. This is the traditional method that is taught in statistics textbooks, and we shall see how this works in Chapter
However, there is an alternative approach, based on the plausible assumption that the population should look roughly like the sample. Since we cannot repeatedly draw a new sample from the population, we instead repeatedly draw new samples from our sample!
We therefore get an idea of how our estimate varies through this process of resampling with replacement. This is known as bootstrapping the data – the magical idea of pulling oneself up by one’s own bootstraps is reflected in this ability to learn about the variability in an estimate without having to make any assumptions about the shape of the population distribution.
Figure 7.3 displays some clear features. The first, and perhaps most notable, is that almost all trace of the skewness of the original samples has gone – the distributions of the estimates based on the resampled data are almost symmetric around the mean of the original data. This is a first glimpse of what is known as the Central Limit Theorem, which says that the distribution of sample means tends towards the form of a normal distribution with increasing sample size, almost regardless of the shape of the original data distribution. This is an exceptional result, which we shall explore further
...more
This is repeated as many times as desired: for illustration, Figure 7.4 shows the fitted lines arising from just twenty resamples in order to demonstrate the scatter of lines. It is clear that, since the original data set is large, there is relatively little variability in the fitted lines and, when based on 1,000 bootstrap resamples, a 95% interval for the gradient runs from 0.22 to 0.44.
Uncertainty intervals based on bootstrapping take advantage of modern computer power, do not require assumptions about the mathematical form of the population and do not require complex probability theory.
But why do we need to use probability theory when doing statistics?
I am often asked why people tend to find probability a difficult and unintuitive idea, and I reply that, after forty years researching and teaching in this area, I have finally concluded that it is because probability really is a difficult and unintuitive idea.
But an alternative would be to use a more intuitive idea which has been shown in numerous psychology experiments to improve people’s reasoning about probability.
they serve to emphasize that all the probabilities we use are conditional – there is no such thing as the unconditional probability of an event; there are always assumptions and other factors that could affect the probability.
This exercise in conditional probability helps us to understand a very counter-intuitive result: in spite of the ‘90% accuracy’ of the scan, the vast majority of women with a positive mammogram do not have breast cancer. It is easy to confuse the probability of a positive test, given cancer, with the probability of cancer, given a positive test.
It is an easy mistake to make, but the logic is as faulty as going from the statement ‘if you’re the Pope, then you’re a Catholic’ to ‘if you’re a Catholic, then you’re the Pope’, where the flaw is somewhat simpler to spot.
Don’t expect a neat consensus from the ‘experts’. They may agree on the mathematics of probability, but philosophers and statisticians have come up with all sorts of different ideas for what these elusive numbers actually mean, and argue intensively over them. Some popular suggestions include:
Classical probability: This is what we are taught in school, based on the symmetries of coins, dice, packs of cards, and so on, and can be defined as, ‘The ratio of the number of outcomes favouring the event divided by the total number of possible outcomes, assuming the outcomes are all equally likely.’ For example, the probability of throwing a ‘one’ on a balanced die is 1/6, since there are six faces. But this definition is somewhat circular, as we need to have a definition of ‘equally likely’.
‘Enumerative’ probability:fn4 Suppose there are three white socks and four black socks in a drawer, and we take a sock at random, what is t...
This highlight has been truncated due to consecutive passage length restrictions.
‘Long-run frequency’ probability: This is based on the proportion of times an event occurs in an infinite sequence of identical experiments, exactly as we found when we simulated the Chevalier’s games.
Propensity or ‘chance’: This is the idea that there is some objective tendency of the situation to produce an event.
Subjective or ‘personal’ probability: This is a specific person’s judgement about a specific occasion, based on their current knowledge,
Different ‘experts’ have their own preference among these alternatives, but personally I prefer the final interpretation – subjective probability. This means I take the view that any numerical probability is essentially constructed according to what is known in the current situation – indeed probability doesn’t really ‘exist’ at all (except possibly at the subatomic level). This approach forms the basis for the Bayesian school of statistical inference, which we will explore in detail in Chapter 11.
We now come to the crucial but difficult stage of laying out the general connection between probability theory, data and learning about whatever target population we are interested in.
Probability theory naturally comes into play in what we shall call situation 1: When the data-point can be considered to be generated by some randomizing device, for example when throwing dice, flipping coins, or randomly allocating an individual to a medical treatment using a pseudo-random-number generator, and then recording the outcomes of their treatment.
But in practice we may be faced with situation 2: When a pre-existing data-point is chosen by a randomizing device, say when selec...
This highlight has been truncated due to consecutive passage length restrictions.
And much of the time our data arises from situation 3: When there is no randomness at all, but we act as if the data-point were in fact generated by some random process, for example in ...
This highlight has been truncated due to consecutive passage length restrictions.
In Chapter 3 we discussed the idea of a metaphorical population, comprising the possible eventualities that might have occurred, but mainly didn’t. We now need to brace ourselves for an apparently irrational step: we need to act as if data were generated by a random mechanism from this population, even though we know full well that it was not.
But what is the justification for building a probability distribution? The number of homicides recorded each day in a country is simply a fact – there has been no sampling, and there is no explicit random element generating each unfortunate event. Just an immensely complex and unpredictable world. But whatever our personal philosophy behind luck or fortune, it turns out that it is useful to act as if these events were produced by some random process driven by probability.
Data of this kind can be represented as observations from a Poisson distribution, which was originally developed by Siméon Denis Poisson in France in the 1830s to represent the pattern of wrongful convictions per year. Since then it has been used to model everything from the number of goals scored by a football team in a match or the number of winning lottery tickets each week, to the number of Prussian officers kicked to death by their horses each year. In each of these situations there is a very large number of opportunities for an event to happen, but each with a very low chance of
...more
Whereas the normal (or Gaussian) distribution in Chapter 3 required two parameters – the population mean and standard deviation – the Poisson distribution depends only on its mean.
Figure 8.5 compares the expected distribution of the daily number of homicide incidents based on a Poisson assumption, and the actual empirical data distribution over these 1,095 days – the match is very close indeed, and in Chapter 10 I will show how to test formally whether the Poisson assumption is justified.
We have therefore established that probability forms the appropriate mathematical foundation for both ‘pure’ randomness, which occurs with subatomic particles, coins, dice, and so on; and ‘natural’, unavoidable variability, such as in birth weights, survival after surgery, examination results, homicides, and every other phenomenon that is not totally predictable.
In the next chapter we come to a truly remarkable development in the history of human understanding: how these two aspects of probability can be brought together to provide a rigorous basis for formal statistical inference.
In the last chapter we discussed the idea of a random variable – a single data-point drawn from a probability distribution described by parameters. But we are seldom interested in just one data-point – we generally have a mass of data which we summarize by determining means, medians and other statistics. The fundamental step we will take in this chapter is to consider those statistics as themselves being random variables, drawn from their own distributions.
And given the discussion of the bootstrap in Chapter 7, it would be reasonable to ask why we need all that mathematics, when we can work out uncertainty intervals and so on using simulation-based bootstrap approaches.