A Theoretical Lens for the Reproducibility Project

Recently, the Open Science Collaboration, a team of over 250 scientists organized by the Center for Open Science, published the results of their Reproducibility Project: Psychology, in which 100 highly visible social psychology studies were replicated. The headline result is that almost two-thirds of the studies failed to find “statistically significant” results. By the standards of the field’s traditional criteria, this means that most of the published studies failed to replicate. The study has been making waves all over the place, and rightly so. This paper represents a tremendous amount of work that inarguably improves what we know and how we think about psychological research, perhaps all scientific research.

Knowing exactly what to make of all this is tricky, however. A number of media outlets cry “Most psychological research is wrong! It’s all bunk!” This is overblown, but it raises the question: what does it all mean? Several excellent scientists have already made valuable contributions to this discussion (notably Michael Frank, Daniel Lakens, Alexander Etz, Lisa Feldman Barrett). Here I add my own.

I am not an experimental psychologist – even though I started my graduate school career doing psychophysics, then animal behavior. I work primarily as a modeler. Last year, Richard McElreath and I developed a mathematical model of scientific discovery. Our goal was to tackle several questions related to replication, publication bias, and the evidential value of scientific results, given that (a) many (perhaps most) novel hypotheses are false, (b) some false positives are inevitable, and (c) some results are more likely to be published than others (there are, of course, other assumptions, but these are the most relevant ones in the context of this post). It is a happy coincidence that our paper was published the day before the Reproducibility Project paper. More so because our model provides a theoretical lens through which to view their results.

Our model focused, in part, on the probability that a hypothesis is true, given a series of positive and negative findings – that is, given some number of successful or unsuccessful replications. I won’t go into detail regarding our model construction or analysis, though I hope that you will read the paper. Rather, I want to share a few thoughts about doing science that came from viewing the Reproducibility Project results through the lens of our mathematical model.

1. We shouldn’t be too surprised that many findings fail to replicate, but we can still do better.

Coming up with testable hypotheses is hard. This point has been made repeatedly over the last decade – if novel hypotheses tend to be wrong, then many results will be false positives, which are (thankfully) likely to fail to replicate. There are two things we can do to improve the situation here.

First, we can try to lower the rate of false positives. Many have suggested pre-registration of hypotheses. On the other hand, exploratory analyses are vital to scientific discovery. A compelling compromise is that researchers should make it crystal clear whether their results followed from an exploration of existing data or came from a test specifically designed to test their a priori hypothesis, in which case pre-registration is desirable. More epistemological weight should be placed on findings of the latter kind. In general, experimental and statistical methods that decrease false positives are a good thing.

Second, we can try to increase the a priori probability that the hypotheses we test are true. As a theorist, it is perhaps unsurprising that my recommendation is: better theory. Specifically, I think psychology should more fully embrace formal modeling, so that its theories are much more precisely specified. There will be some growing pains, but an added benefit of this will be that empirical findings that fit coherent theories will have a long shelf life. As Gerd Gigerenzer has opined, data without theory are like a baby without a parent: their life expectancy is short.

All that said, we shouldn’t take the results of the Reproducibility Project as a dismissal of psychology as a field with poor theory and lots of false positives (although this may be more true in some subfields than in others). False positives can occur under the best of conditions, as can false negatives. For this reason…

2. We shouldn’t put too much stock in any one result.

Science is an imperfect process. A true hypothesis may fail to yield a positive result, and a false hypothesis may appear true given some experimental data. As such, in most cases results should be interpreted probabilistically – the probability that some hypothesis is true given the data. When replication is common, those data will include the results of multiple studies. This would be a very good thing.

Using our model, we analyzed a pessimistic but perhaps not unrealistic scenario in which only one in a thousand tested hypotheses were true, power was 0.6, and the false positive rate was 0.1. A base rate of one in a thousand may seem overly low, but keep in might that this includes each and every hypothesis tested in an exploratory data analysis, that is, every possible association between variables. In that light, a low probability that any one of those associations will really exist may not seem quite as outlandish. Under these conditions, the vast majority of initial positive findings are expected to be false positives. We found that in order to have a greater than 50% confidence that the hypothesis is true, it would need to be successfully replicated three more times. Even if we increase the base rate 100-fold, so that one in ten hypotheses are true, no result that hasn’t been successfully replicated at least once can be trusted with over 50% confidence.

If many replications are needed to establish confidence, then perhaps we shouldn’t cry foul over a single failure to replicate. In some areas of research, most initial results should be viewed with at least some skepticism. This means that the rewards for any novel result, no matter how astonishing, should be moderate. Even more so given the fact that highly surprising results are more likely to be wrong.

3. Replication efforts are valuable even when they are imperfect.

One of the great things about the Reproducibility Project is the extent to which it involved the authors of the original studies being replicated. This is important, because replication efforts have been attacked as a sort of vigilantism, or as the work of dilettantes who lack the expertise or nuance to perform a precise replication. This argument is not without merit. An extreme version holds that failures to replicate are wholly uninformative. This argument is without merit. Our analysis shows quite clearly that the replication efforts are informative even when replications have substantially less power than the initial studies. Power need only be high enough so that true hypotheses are, on average, more likely to yield positive results than negative results. That said, it is a sad truth that this criterion will not always be met.

4. Publishing null results comes with some caveats, but we should almost always publish replication efforts.

Among the forces working against replication efforts is the fact that null results and replications are sometimes difficult to publish. A recent analysis of the “file drawer” effect showed that most null results weren’t published because the authors never bothered to submit them. Our analysis highlights the critical importance of replication in assessing the truth or falsehood of a hypothesis. Several replications may be needed to establish confidence, and that requires that scientists be made aware of efforts to replicate previous findings. All replications should be published. Correspondingly, outlets for publishing those replications are needed, as are incentives to young scientists for authoring them.

On the other hand, it is not clear that publishing absolutely every result is a good thing. If most novel hypotheses are wrong, then most novel results will be correct rejections of those hypotheses. In this case, publishing every result would fill our journals with these true negatives, making it difficult to find the positive results. Even worse would be if substantial replication efforts were devoted to confirming the falsehood of those negative results. Admittedly, this scenario is unlikely – the allure of positive results is just too strong. Even so, our analysis indicates that calls to publish every result come with caveats. A possible solution is the establishment of a repository for very brief reports indicating the failure of experimental tests to yield positive results. Such a repository would be easily searchable, avoid clogging up journals, and require minimal effort on the part of busy scientists with ticking tenure clocks.

Coda

I have kept this discussion qualitative, and have purposely avoided mathematical or statistical details in order to maximize generality and accessibility. There are lots of important points to be made regarding methodology, replication, and publication bias that I have sidestepped. Hopefully it has been useful nevertheless.

Paul E. Smaldino

Paradigmatically promiscuous scientist