Don’t call it a comeback

The reproducibility of psychological ... Spike activity 28-08-2015

Don’t call it a comeback

Duchenne_de_Boulogne The Reproducibility Project, the giant study to re-run experiments reported in three top psychology journals, has just published its results and its either a disaster, a triumph or both for psychology.

You can’t do better than the coverage in The Atlantic, not least as it’s written by Ed Yong, the science journalist who has been key in reporting on, and occasionally appearing in, psychology’s great replication debates.

Two important things have come out of the Reproducibility Project. The first is that psychologist, project leader and now experienced cat-herder Brian Nosek deserves some sort of medal, and his 270-odd collaborators should be given shoulder massages by grateful colleagues.

It’s been psychology’s equivalent of the large hadron collider but without the need to dig up half of Switzerland.

The second is that no-one quite knows what it means for psychology. 36% of the replications had statistically significant results and 47% had effect sizes in a comparable range although the effect sizes were typically 50% smaller than the originals.

When looking at replication by subject area, studies on cognitive psychology were more likely to reproduce than studies from social psychology.

Is this good? Is this bad? What would be a reasonable number to expect? No one’s really sure, because there are perfectly acceptable reasons why more positive results would be published in top journals but not replicate as well, alongside lots of not so acceptable reasons.

The not-so-acceptable reasons have been well-publicised: p-hacking, publication bias and at the darker end of the spectrum, fraud.

But on the flip side, effects like regression to the mean and ‘surprisingness’ are just part of the normal routine of science.

‘Regression to the mean’ is an effect where, if the first measurement of an effect is large, it is likely to be closer to the average on subsequent measurements or replications, simply because things tend to even out over time. This is not a psychological effect, it happens everywhere.

Imagine you record a high level of cosmic rays from an area of space during an experiment and you publish the results. These results are more likely to merit your attention and the attention of journals because they are surprising.

But subsequent experiments, even if they back up the general effect of high readings, are less likely to find such extreme recordings, because by definition, it was their statistically surprising nature that got them published in the first place.

The same may well be happening here. Top psychology journals currently specialise in surprising findings. The editors have shaped these journal by making a trade-off between surprisingness and stability of the findings, and currently they are tipped far more towards surprisingness. Probably unhealthily so.

This is exactly what the Reproducibility Project found. More initially surprising results were less likely to replicate.

But it’s an open question as to what’s the “right balance” of surprisingness to reliability for any particular journal or, indeed, field.

There’s also a question about reliability versus boundedness. Just because you don’t replicate the results of a particular experiment it doesn’t necessarily mean the originally reported effect was a false positive. It may mean the effect is sensitive to a particular context that isn’t clear yet. Working this out is basically the grunt work of science.

Some news outlets have wrongly reported that this study shows that ‘about two thirds of studies in psychology are not reliable’ but the Reproducibility Project didn’t sample widely enough across publications to be able to say this.

Similarly, it only looked at initially positive findings. You could easily imagine a ‘Reverse Reproducibility Project’ where a whole load of original studies that found no effect are replicated to see which subsequently do show an effect.

We know study bias tends to favour positive results but that doesn’t mean that all negative findings should be automatically accepted as the final answer either.

The main take home messages are that findings published in leading journals are not a good guide to invariant aspects of human nature. And stop with the journal worship. And let’s get more pre-registration on the go. Plus science is hard.

What is also clear, however, is that the folks from the Reproducibility Project deserve our thanks. And if you find one who still needs that shoulder massage, limber up your hands and make a start.

Link to full text of scientific paper in Science.

Link to coverage in The Atlantic.