More on this book
Community
Kindle Notes & Highlights
Read between
March 12 - June 25, 2022
By fitting our line so closely to the points, we’re just modelling the random noise that exists in our dataset. The model overfits the data.
This is what scientists are unwittingly doing when they p-hack: they’re making a big deal of what is often just random noise, counting it as part of the model instead of as nuisance variation that should be disregarded in favour of the real signal
The researchers called what happened between dissertation and journal publication ‘the chrysalis effect’. By the time they reached their final publication, initially ugly sets of findings had often metamorphosed into handsome butterflies, with messy-looking, non-significant results dropped or altered in favour of a clear, positive narrative.75 In most cases, the students probably thought that by nixing such results they were letting their data more clearly ‘tell a story’
datapoints – interesting, to be sure, but not central to the study. What happens if you don’t find your desired statistically significant result for height, but you do find a significant difference between men and women in, say, their amount of TV watching? Outcome-switching is when you decide to present the study as if it had always been about TV time.
But when a lucrative career rests on the truth of a certain theory, a scientist gains a new motivation in their day job: to publish only studies that support that theory (or p-hack them until they do). This is a financial conflict of interest like any other and one aggravated by the extra reputational concerns.
least, it might feel that way if you have the wrong attitude to science. The currency of positive, statistically significant results in science is so strong that many researchers forget that null results matter too. To know that a treatment doesn’t work, or that a disease isn’t related to some bio-marker, is useful information: it means we might want to spend our time and money elsewhere in future.
biases might affect the practice of the science itself.103 For example, neuroscientists working with animals such as mice have a tendency to study only the male of the species on which they’re experimenting. This is because females are often thought to be more affected by hormonal fluctuations – a hard-to-control source of variability in the animals’ brains and emotions, which might compromise the results.
Trying to correct for bias in science by injecting an equal and opposite dose of bias only compounds the problem, and potentially invites a vicious cycle of ever-increasing division between different ideological camps.
The paper had said average growth above the 90 per cent ratio was −0.1 per cent; after the corrections it was +2.2 per cent. There was nothing magic about that 90 per cent number after all; growth didn’t suddenly turn negative after that threshold. In reality, there was ‘a wide range of GDP growth performances at every level of public debt’.7
This chapter is concerned with two kinds of scientific negligence. The first is what we just encountered: unforced errors that are introduced to scientific analyses by inattention, oversight or carelessness. The second kind is when scientists, who should know better, bake errors into the very way their studies are designed. This latter kind of mistake could be due to poor training, apathy, forgetfulness or, much as it seems cruel to say, sheer incompetence.
mistakes were minor and made little difference to the overall results. But some of the inconsistencies had a big effect on the study conclusions: 13 per cent had a Reinhart-Rogoff-style serious mistake that might have completely changed the interpretation of their results
The inconsistencies that statcheck flagged tended to be in the authors’ favour – that is, mistaken numbers tended to make the results more, rather than less, likely to fit with the study’s hypothesis. If these were just entirely random typos, we wouldn’t expect them to go in any particular direction on average. But as we might have predicted from what we know about bias, it seems as though the scientists were more likely to take a second look when the results didn’t go their way.
16 The story reminds us once more that even ‘classic’ findings from the scientific literature – the ones that you would hope had been examined most rigorously – can be wholly unreliable, with what should be the most important part, the numbers and the data, acting as mere window-dressing in service of an attention-grabbing story.
A study in 2006 found that a paltry 26 per cent of psychologists were willing to send their data to other researchers upon an email request, and similarly dismal figures come from other fields. You’re also much less likely to be able to access the data the older a study gets.22 This reluctance to share data is a block on the vital processes of self-scrutiny – those Mertonian norms of communalism and organised scepticism again – that lie at the heart of science.
A 2017 analysis which scoured the literature for studies using known misidentified cell lines found an astonishing 32,755 papers that used so-called impostor cells, and over 500,000 papers that cited those contaminated studies.
Large-scale reviews have also found that underpowered research is rife in medical trials, biomedical research more generally, economics, brain imaging, nursing research, behavioural ecology and – quelle surprise – psychology.
since underpowered studies only have the power to detect large effects, those are the only effects they see. This is where the logic leads. If you find an effect in an underpowered study, that effect is probably exaggerated.
Then comes publication bias: since large effects are exciting effects, they’re much more likely to go on to get published.
the conclusion of the new, high-power GWASs was that, with a few very rare exceptions, complex human traits are generally related to many thousands of genetic variants, each of which appears to contribute only a minuscule effect.59 There was no space for large effects of single genes, which was completely at odds with the results of all those previously lauded candidate gene studies. Since then, efforts that specifically tried to replicate the candidate gene studies with high statistical power have produced flat-as-a-pancake null results for IQ test scores, depression and schizophrenia.
Ronald Fisher, the statistician who popularised the p-value and the idea of ‘statistical significance’, worked out that complex traits must be massively polygenic – that is, must be related to many thousands of small-effect genes – as far back as 1918.63
When I’ve made comments about low statistical power in scientific seminars, I’ve often heard replies along the lines of ‘my students need to publish papers to be competitive on the job market, and they can’t afford to do large-scale studies. They need to make do with what they have.’ This is a prime example of well-intentioned individual scientists being systematically encouraged – some would argue, forced – to accept compromises that ultimately render their work unscientific.
Their propensity to mislead means that low-powered studies actively subtract from our knowledge: it would often have been better never to have done them in the first place. Scientists who knowingly run low-powered research, and the reviewers and editors who wave through tiny studies for publication, are introducing a subtle poison into the scientific literature, weakening the evidence that it needs to progress.
the press release exaggerated the claim first, similar exaggeration in the media was 6.5 times more likely for advice claims, 20 times more likely for causal claims, and a whopping 56 times more likely for translational claims.
Another trial from 2019 filled in the next part of the story: hyped health news stories really did make readers more likely to believe a treatment was beneficial.27
A study in 2017 found that only around 50 per cent of health studies covered in the media are eventually confirmed by meta-analyses (that is, 50 per cent are found to be broadly replicable).
Letting the facts slide in favour of a good story risks a race to the bottom, with science books getting published that are ever more inaccurate and ever more divorced from the data. When these books are inevitably debunked, or the lifestyle changes they recommend fail to live up to the hype, damage is done to the reputation of science more generally.
The steep rise in positive-sounding phrases in scientific journals tells us that hype isn’t just restricted to press releases and popular-science books: it has seeped into the way scientists write their papers. In the scientific community, this species of hype is often referred to with a term borrowed from politics: spin.
The hypothesis was that the mice with microbiomes from autistic people would choose the object over the mouse companion – but they showed no difference. As the science writer Jon Brock noted in a detailed critique of the study, the authors quickly skipped over this inconvenient result in a single sentence, whereas all the results that turned out to be significant were featured in a full-colour graph in the paper.84
Hyping and spinning such a tiny, preliminary study would have been bad enough, but it gets worse. When the biostatistician Thomas Lumley obtained the authors’ data and attempted to reproduce their analysis, he found that they’d fouled up their statistical tests.
nutritional epidemiology is hard. An incredibly complex physiological and mental machinery is involved in the way we process food and decide what to eat; observational data are subject to enormous noise and the vagaries of human memory; randomised trials can be tripped up by the complexities of their own administration. Given that context, the sheer amount of media interest in nutritional research is particularly unfortunate. Perhaps the very scientific questions that the public wants to have answered the most – what to eat, how to educate children, how to talk to potential employers, and so
...more
What we’ll find is that the scientific incentive system engenders an obsession not just with certain kinds of papers, but with publication itself. The system incentivises scientists not to practise science, but simply to meet its own perverse demands.
In the academic job market, hiring and promotion decisions are based in no small part on how many publications you have on your CV, and in which journals they’re published. Publish too few papers, and publish them in too-obscure outlets, and you’ve a far lower chance of getting or retaining a job.
One analysis showed that in the five years following publication, approximately 12 per cent of medical research papers and around 30 per cent of natural-and social-science papers had zero citations.
once you begin to chase the numbers themselves rather than the principles that they stand for – in this case, the principle of finding research that makes a big contribution to our knowledge – you’ve completely lost your way.
technology could also help keep those errors from happening in the first place. In recent years, software has been developed that combines statistical analysis and word processing into one program, automatically populating all the relevant tables and figures within the paper.
To paraphrase the biologist Ottoline Leyser, the point of breaking ground is to begin to build something; if all you do is groundbreaking, you end up with a lot of holes in the ground but no buildings.
what is needed is less focus on statistical significance – a p-value below the arbitrary threshold of 0.05 – and more on practical significance. In a study with a large enough sample size (and high enough statistical power), even very small effects – for example, a pill reducing headache symptoms by one per cent of one point on our 1–5 pain scale – can come up as statistically significant, often with p-values far below 0.05, though they could be essentially useless in absolute terms.
Statistics alone cannot solve the underlying problem: the crooked timber of human nature and, by extension, that of the scientific system. No matter which statistical perspective became dominant, some scientists would find ways to game it to make their results look more impressive than they really are.
precondition for publication in most medical journals since 2005.41 Registering a study involves posting a public, time-stamped document online that details what the researchers are planning to do, in advance of collecting any data. A public repository of experiments that are about to be run provides a baseline by which to check what proportion of these studies actually make it to publication.
Pre-registration lets you be clear with your readers whether you were using the data in an exploratory way, to generate hypotheses (‘huh, interesting, variable X seems to be linked to variable Y! We’d better check whether this replicates in a new dataset’), or in a confirmatory way, to nail them down (‘I predicted that variable X would be related to variable Y in this dataset, and sure enough, it is!’).
Another option is to use an even more rigorous version of pre-registration. In this scenario, a scientist submits the registration itself to peer review and, if it’s approved and the reviewers agree that the study design is sound, the journal commits to publishing the eventual paper no matter how its results come out.
Open Science is the idea that as far as possible, every part of the scientific process should be made freely accessible.52 The perfect Open Science study would have an associated webpage where you could download all its data, all the statistical code that the scientists used to analyse those data, and all the materials they used to gather the data in the first place.
53 The peer reviews and previous drafts of the article could be published alongside the article (even if the identity of the reviewers isn’t revealed), allowing the reader to see the whole publication process.54
Mostly, though, open data acts as a deterrent against committing fraud in the first place, since it would take the brassiest of brass necks to post a fake dataset on a public website.55 The same principle works for p-hacking that borders on fraud, and for more innocent errors: allowing other scientists to see your data and how you analysed it means that eagle-eyed peers can spot spreadsheet typos, incorrect statistics, improbable numbers, or undeclared analyses that might affect the way your results should be interpreted.
The perverse incentives have become so deeply embedded that they’ve created a system that’s self-sustaining. Years of implicit and explicit training to chase publications and citations at any cost leave their mark on trainee scientists, forming new norms, habits, and ways of thinking that are hard to break even once a stable job has been secured. And as we discussed in the previous chapter, the system creates a selection pressure where the only academics who survive are the ones who are naturally good at playing the game.
What’s required, then, is a new way of apportioning credit and responsibility in science: a system in which scientists are rewarded for their contributions, rather than for merely having their name on a paper.
When making hiring and tenure decisions they should consider ‘good scientific citizenship’ in addition to, or even in place of, measures such as a scientist’s h-index.95 They should recognise not just publication but the complexity of building international collaborations, the arduousness of collecting and sharing data, the honesty of publishing every study whether null or otherwise, and the unglamorous but necessary process of running replication studies. They should, in other words, reward researchers for working towards a more open and transparent scientific literature, expanding the range
...more
‘five selfish reasons to work reproducibly’:
Scientists think more like Boulez than Górecki. They’ve taken excitement about novelty in science too far, producing a perverse neophilia where every study needs to be a massive breakthrough that revolutionises the way we think about the world
But although unexpected breakthroughs do occur from time to time, most science is incremental and cumulative, building slowly towards tentative theories rather than suddenly leaping to conclusive truths.117 Most science, to be perfectly honest, is quite boring.