More on this book
Community
Kindle Notes & Highlights
Read between
March 12 - June 25, 2022
However – and here’s where it gets strange – when one of the pictures was pornographic, the students tended to choose it slightly more often than chance: 53.1 per cent of the time, to be exact. This met the threshold for ‘statistical significance’.
The editor rejected the paper within a few days, explaining to us that they had a policy of never publishing studies that repeated a previous experiment, whether or not those studies found the same results as the original.
Science, the discipline in which we should find the harshest scepticism, the most pin-sharp rationality and the hardest-headed empiricism, has become home to a dizzying array of incompetence, delusion, lies and self-deception. In the process, the central purpose of science – to find our way ever closer to the truth – is being undermined.
Science is inherently a social thing, where you have to convince other people – other scientists – of what you’ve found. And since science is also a human thing, we know that any scientists will be prone to human characteristics, such as irrationality, biases, lapses in attention, in-group favouritism and outright cheating to get what they want.
peculiar complacency, a strange arrogance, has taken hold, where the mere existence of the peer-review system seems to have stopped us from recognising its flaws. Peer-reviewed papers are supposedly as near as one can get to an objective factual account of how the world works.
All we can hope for is that our scientific studies are trustworthy – that they honestly report what occurred in the research. If the much-vaunted peer-review process can’t justify that trust, science loses one of its most basic and most desirable qualities, along with its ability to do what it does best: revolutionise our world with a steady progression of new discoveries, technologies, treatments and cures.
nearly all of these problems have been uncovered by other scientists.
16
In economics, a miserable 0.1 per cent of all articles published were attempted replications of prior results; in psychology, the number was better, but still nowhere near good, with an attempted replication rate of just over 1 per cent.
When new scientists run the statistics in their own way, the results come out differently. Those studies are like a cookbook including mouth-watering photographs of meals but providing only the patchiest details of the ingredients and recipe needed to produce them.
A 2005 article by the meta-scientist John Ioannidis carried the dramatic title ‘Why Most Published Research Findings Are False’, and its mathematical model concluded just that: once you consider the many ways that scientific studies can go wrong, any given claim in a scientific paper is more likely to be false than true.70 The article may have attracted a great deal of attention and discussion, being cited over 800 times in the first five years after its publication, but in terms of scientists starting to make the necessary changes to improve the quality of research, its Cassandra-like
...more
The authors had omitted a devastating detail: the patient in question had died – seven weeks before the paper was even accepted for publication.8 The patient from the second operation had died even earlier, just three months after his procedure.9 The third Karolinska patient would die in 2017 after several failed follow-up surgeries.
They got together and complained to the heads of the Karolinska Institute. Instead of surprise and concern, they were met with stonewalling and attempts to have them silenced. The Institute even reported the doctors to the police, alleging that by looking through the patients’ medical records they had violated their privacy
there are some wider lessons to learn from his story. The first is how much of science, despite its built-in organised scepticism, comes down to trust: trust that the studies really occurred as reported, that the numbers really are what came out of the statistical analysis, and, in this case, that the patients really did recover in the way that was claimed. Fraud shows just how badly that trust can be exploited.
Whistleblowers from Hwang’s lab revealed that only two cell lines had been created, not eleven, and neither were from cloned embryos.44 The rest of the cell photos had been doctored or deliberately mislabelled under Hwang’s instructions.
the system is largely built on trust: everyone basically assumes ethical behaviour on the part of everyone else. Unfortunately, that’s exactly the sort of environment where fraud can thrive – where fakers, like parasites, can free-ride off the collective goodwill of the community.
Southern blots detect DNA sequences: using radioactive tagging, the blotting produces the images you’ve perhaps seen accompanying news articles on genetics, with semi-rectangular blurs of varying size organised into vertical ladders or ‘lanes’.
western blots can be used to diagnose some diseases by picking up on the production of proteins that indicate the presence of certain bacteria or viruses. Scientists often proudly display the blots as figures in their papers, with the blot providing the crucial evidence that some chemical has been detected in the experiment.
Each one was supposed to come from the same blot, but if you looked very closely, the background of one was darker than the others, with suspiciously sharp edges. It turned out that it had been spliced in from another, separate blot picture, and subtly resized to fit better with the other lanes.
A more detailed investigation revealed that Obokata’s rap sheet went beyond the initial charge of fabricating images: she’d also included figures from older research that she rebranded as new, and faked data that showed how quickly the cells had grown. Any actual evidence of pluripotency was due to her allowing her samples to become contaminated with embryonic stem cells.60
Perhaps most intriguing, though, were the results about repeat offenders: when Bik and her team found a paper with faked images, they checked to see if other publications by the same author also had image duplications. This was true just under 40 per cent of the time. Duplicating one image may be regarded as carelessness; duplicating two looks like fraud.
If a dataset looks too neat, too tidily similar across different groups, something strange might be afoot. As the geneticist J. B. S. Haldane put it, ‘man is an orderly animal’ who ‘finds it very hard to imitate the disorder of nature’, and that goes for fraudsters as much as for the rest of us.
what LaCour, like Sanna, Smeesters, and Stapel before him, got from his data fabrication was control. The study fit the exact specifications required to convince Science’s peer reviewers it was worth publishing. It was what the publication system and the university job market appeared to demand: not a snapshot of a messy reality where results can be unclear and interpretations uncertain, but a clean, impactful result that could immediately translate to use in the real world.
the Retraction Watch Database shows that just 2 per cent of individual scientists are responsible for 25 per cent of all retractions.86 The worst repeat offenders are added to the Retraction Watch Leaderboard, a position on which is a sort of reverse Nobel Prize.
The undisputed heavyweight champion of retractions, however, is currently the Japanese anaesthesiologist Yoshitaka Fujii, who invented data from nonexistent drug trials and whose retracted papers number an astonishing 183.
The biggest study on this question to date pooled research from seven surveys, finding that 1.97 per cent of scientists admit to faking their data at least once.
of the thirty-two scientists currently on the Retraction Watch Leaderboard, only one is a woman.95 To know whether this tells us something important, we’d need to know what the base rate of men versus women was in each relevant field, and thus whether men were overrepresented.
A study in 2013, focusing on the life sciences, took those baseline differences into account and found that men were indeed overrepresented as the subject of fraud reports from the US Office for Research Integrity.
One thing that stood out was that image duplication was more likely to take place in some countries rather than others: India and China were overrepresented in the number of papers with duplicated images, while the US, the UK, Germany, Japan and Australia were underrepresented. The authors proposed that these differences were cultural: the looser rules and softer punishments for scientific misconduct in countries like India and China might be responsible for their producing a higher quantity of potentially fraudulent research.
Science requires transparency, it requires valuing method over results, and it should be ideologically neutral. These are not concepts that flourish under a totalitarian regime.
Gould ascribed to Morton, are not the main focus of this chapter. The biases that’ll chiefly concern us have to do with the process of science itself: biases towards getting clear or exciting results, supporting a pet theory, or defeating a rival’s argument. Any one of these can be enough to provoke unconscious data-massaging, or in some cases, the out-and-out disappearance of unsatisfactory results.
Reading the science pages in the newspaper, one could be forgiven for thinking that scientists are constantly having their predictions verified and their hypotheses supported by their research, while studies that don’t find anything of interest are as rare as hens’ teeth.
but it shows just the same bias towards new and exciting stories. If one looks through the journals, one finds endless positive results (where the scientists’ predictions pan out or something new is found) but very few null results (where researchers come up empty-handed). In just a moment we’ll dive into the technical, statistical definition of ‘positive’ versus ‘null’ results. For now, it’s enough to know that scientists are usually looking for the former and are disappointed to end up with the latter.
numbers are noisy. Every measurement and every sample comes with a degree of random statistical variation, of measurement and sampling error. This is not just hard for a human to fake – it’s also hard to distinguish from the signal for which scientists are looking. The noisiness of numbers constantly throws up random outliers and exceptions, resulting in patterns that might in fact be meaningless and misleading.
There are endless types of statistical tests, such as Z-tests, t-tests, chi-squared tests and likelihood ratios; the choice depends, among other things, on the kind of data you’re looking at. Essentially all of them are done these days by feeding your data into computer software. When you run one of these programs, its output will include, alongside many other useful numbers, the relevant p-value.15
The p-value is the probability that your results would look the way they look, or would seem to show an even bigger effect, if the effect you’re interested in weren’t actually present.17 Notably, the p-value doesn’t tell you the probability that your result is true (whatever that might mean), nor how important it is. It just answers the question: ‘in a world where your hypothesis isn’t true, how likely is it that pure noise would give you results like the ones you have, or ones with an even larger effect?’
The size of an effect (representing, for example, how much taller Scottish men are than Scottish women; in our case the effect size is 10 cm) is a different thing from the probability of seeing these results by chance if your hypothesis isn’t true. It’s perfectly possible, for example, that a drug has a very minor impact on an illness, but one that you’re quite sure isn’t a false positive – a small yet statistically significant effect. Back when Fisher was writing, people understood the word ‘significant’ somewhat differently: it implied that the result ‘signified’ that something was happening
...more
the 0.05 cut-off for statistical significance encourages researchers to think of results below it as somehow ‘real’, and those above it as hopelessly ‘null’. But 0.05 is as much a convention as the 17-degree taps-aff rule – or, slightly more seriously, the societal decision that someone legally becomes an adult precisely on a particular birthday.
this post hoc rationalisation wouldn’t have occurred to them if the same small-sample study, with its potentially noisy data, happened to show a large effect: they’d have eagerly sent off their positive results to a journal. This double standard, based on the entrenched human tendency towards confirmation bias (interpreting evidence in the way that fits our pre-existing beliefs and desires), is what’s at the root of publication bias.
If the medical literature gives doctors an inflated view of how much benefit a drug provides (as indeed appears to have been the case for antidepressants, which do seem to work, but not with as strong an effect as initially believed), their clinical reasoning will be knocked off track.
In all, it turned out, 41 per cent of the studies that were completed found strong evidence for their hypothesis, 37 per cent had mixed results, and 22 per cent were null.
Of the articles that were published, the percentages for strong, mixed, and null results were 53 per cent, 37 per cent and 9 per cent, respectively. There was, in other words, a 44-percentage-point chasm between the probability of publication for strong results versus null ones.
Rather, this is a sort of unconscious, or semi-conscious, massaging of data – ‘dimly perceived finagling’, to use Gould’s words – into which scientists can fall entirely innocently. Indeed, what’s scary about data manipulation is not only that it leads to false conclusions entering the literature, but that so many scientists either do it completely unwittingly or, if they’re aware that they’re doing it, are oblivious as to why it’s wrong.
They can make these impromptu analysis changes in all sorts of ways: dropping particular datapoints, re-calculating the numbers within specific subgroups (for instance, checking for the effect just in males, then females), trying different kinds of statistical tests, or keeping on collecting new data with no plan to stop until something reaches significance.
themselves, that they’d been searching for these results from the start.49 This latter type of p-hacking is known as HARKing, or Hypothesising After the Results are Known. It’s nicely summed up by the oft-repeated analogy of the ‘Texas sharpshooter’, who takes his revolver and randomly riddles the side of a barn with gunshots, then swaggers over to paint a bullseye around the few bullet holes that happen to be near to one another, claiming that’s where he was aiming all along.
A one-in-a million chance, after all, happens quite a few times if your population includes several million people. Increase the number of opportunities for a chance result to arise, and you can bet that eventually it will do so; cherry-picking specific instances from the vast multitudes is no proof that it wasn’t just an accident.
even when there’s nothing going on, you’ll still regularly get ‘significant’ p-values, especially if you run lots of statistical tests.52 It’s analogous to the ‘psychics’ who make thousands of predictions about what’ll happen in the next year, then at the end of that year highlight only the ones they got correct, making it look as if they have magical forecasting skills.53 Roll the statistical dice enough times and something will show up as significant, even if it’s just a freak accident in your data.54 Hide away all the times it wasn’t significant and you have the perfect recipe to convince
...more
As we discussed in the last chapter when we considered the motives of fraudulent scientists, if you already believe your hypothesis is true before testing it, it can seem eminently reasonable to give any uncertain results a push in the right direction. And whereas the true fraudster knows that they’re acting unethically, everyday researchers who p-hack often don’t.
Meta-science experiments in which multiple research groups are tasked with analysing the same dataset or designing their own study from scratch to test the same hypothesis, have found a high degree of variation in method and results.
This is the trouble with all kinds of p-hacking, whether explicit or otherwise: they cause the analysis to – using the technical term – overfit the data.73 In other words, the analysis might describe the patterns in that specific dataset well, but those patterns could just be noisy quirks and idiosyncrasies that won’t generalise to other data, or to the real world. This is useless.