Bad science evolves

Richard McElreath and I wrote a paper about how incentives to publish can create conditions for the cultural evolution of low-quality research methods. It’s called The Natural Selection of Bad Science (coming soon to an open access journal near you), and it’s already gotten a few write-ups, for which I’m grateful. I mention this because the Society for Personality and Social Psychology (SPSP)’s Character and Context blog asked me to write a post about the paper, which I did. Check it out.

Bad Science Evolves. Stopping It Means Changing Institutional Selection Pressures.

A Theoretical Lens for the Reproducibility Project

Recently, the Open Science Collaboration, a team of over 250 scientists organized by the Center for Open Science, published the results of their Reproducibility Project: Psychology, in which 100 highly visible social psychology studies were replicated. The headline result is that almost two-thirds of the studies failed to find “statistically significant” results. By the standards of the field’s traditional criteria, this means that most of the published studies failed to replicate. The study has been making waves all over the place, and rightly so. This paper represents a tremendous amount of work that inarguably improves what we know and how we think about psychological research, perhaps all scientific research.

Knowing exactly what to make of all this is tricky, however. A number of media outlets cry “Most psychological research is wrong! It’s all bunk!” This is overblown, but it raises the question: what does it all mean? Several excellent scientists have already made valuable contributions to this discussion (notably Michael Frank, Daniel Lakens, Alexander Etz, Lisa Feldman Barrett). Here I add my own.

I am not an experimental psychologist – even though I started my graduate school career doing psychophysics, then animal behavior. I work primarily as a modeler. Last year, Richard McElreath and I developed a mathematical model of scientific discovery. Our goal was to tackle several questions related to replication, publication bias, and the evidential value of scientific results, given that (a) many (perhaps most) novel hypotheses are false, (b) some false positives are inevitable, and (c) some results are more likely to be published than others (there are, of course, other assumptions, but these are the most relevant ones in the context of this post). It is a happy coincidence that our paper was published the day before the Reproducibility Project paper. More so because our model provides a theoretical lens through which to view their results.

Our model focused, in part, on the probability that a hypothesis is true, given a series of positive and negative findings – that is, given some number of successful or unsuccessful replications. I won’t go into detail regarding our model construction or analysis, though I hope that you will read the paper. Rather, I want to share a few thoughts about doing science that came from viewing the Reproducibility Project results through the lens of our mathematical model.

1. We shouldn’t be too surprised that many findings fail to replicate, but we can still do better.

Coming up with testable hypotheses is hard. This point has been made repeatedly over the last decade – if novel hypotheses tend to be wrong, then many results will be false positives, which are (thankfully) likely to fail to replicate. There are two things we can do to improve the situation here.

First, we can try to lower the rate of false positives. Many have suggested pre-registration of hypotheses. On the other hand, exploratory analyses are vital to scientific discovery. A compelling compromise is that researchers should make it crystal clear whether their results followed from an exploration of existing data or came from a test specifically designed to test their a priori hypothesis, in which case pre-registration is desirable. More epistemological weight should be placed on findings of the latter kind. In general, experimental and statistical methods that decrease false positives are a good thing.

Second, we can try to increase the a priori probability that the hypotheses we test are true. As a theorist, it is perhaps unsurprising that my recommendation is: better theory. Specifically, I think psychology should more fully embrace formal modeling, so that its theories are much more precisely specified. There will be some growing pains, but an added benefit of this will be that empirical findings that fit coherent theories will have a long shelf life. As Gerd Gigerenzer has opined, data without theory are like a baby without a parent: their life expectancy is short.

All that said, we shouldn’t take the results of the Reproducibility Project as a dismissal of psychology as a field with poor theory and lots of false positives (although this may be more true in some subfields than in others). False positives can occur under the best of conditions, as can false negatives. For this reason…

2. We shouldn’t put too much stock in any one result.

Science is an imperfect process. A true hypothesis may fail to yield a positive result, and a false hypothesis may appear true given some experimental data. As such, in most cases results should be interpreted probabilistically – the probability that some hypothesis is true given the data. When replication is common, those data will include the results of multiple studies. This would be a very good thing.

Using our model, we analyzed a pessimistic but perhaps not unrealistic scenario in which only one in a thousand tested hypotheses were true, power was 0.6, and the false positive rate was 0.1. A base rate of one in a thousand may seem overly low, but keep in might that this includes each and every hypothesis tested in an exploratory data analysis, that is, every possible association between variables. In that light, a low probability that any one of those associations will really exist may not seem quite as outlandish. Under these conditions, the vast majority of initial positive findings are expected to be false positives. We found that in order to have a greater than 50% confidence that the hypothesis is true, it would need to be successfully replicated three more times. Even if we increase the base rate 100-fold, so that one in ten hypotheses are true, no result that hasn’t been successfully replicated at least once can be trusted with over 50% confidence.

If many replications are needed to establish confidence, then perhaps we shouldn’t cry foul over a single failure to replicate. In some areas of research, most initial results should be viewed with at least some skepticism. This means that the rewards for any novel result, no matter how astonishing, should be moderate. Even more so given the fact that highly surprising results are more likely to be wrong.

3. Replication efforts are valuable even when they are imperfect.

One of the great things about the Reproducibility Project is the extent to which it involved the authors of the original studies being replicated. This is important, because replication efforts have been attacked as a sort of vigilantism, or as the work of dilettantes who lack the expertise or nuance to perform a precise replication. This argument is not without merit. An extreme version holds that failures to replicate are wholly uninformative. This argument is without merit. Our analysis shows quite clearly that the replication efforts are informative even when replications have substantially less power than the initial studies. Power need only be high enough so that true hypotheses are, on average, more likely to yield positive results than negative results. That said, it is a sad truth that this criterion will not always be met.

4. Publishing null results comes with some caveats, but we should almost always publish replication efforts.

Among the forces working against replication efforts is the fact that null results and replications are sometimes difficult to publish. A recent analysis of the “file drawer” effect showed that most null results weren’t published because the authors never bothered to submit them. Our analysis highlights the critical importance of replication in assessing the truth or falsehood of a hypothesis. Several replications may be needed to establish confidence, and that requires that scientists be made aware of efforts to replicate previous findings. All replications should be published. Correspondingly, outlets for publishing those replications are needed, as are incentives to young scientists for authoring them.

On the other hand, it is not clear that publishing absolutely every result is a good thing. If most novel hypotheses are wrong, then most novel results will be correct rejections of those hypotheses. In this case, publishing every result would fill our journals with these true negatives, making it difficult to find the positive results. Even worse would be if substantial replication efforts were devoted to confirming the falsehood of those negative results. Admittedly, this scenario is unlikely – the allure of positive results is just too strong. Even so, our analysis indicates that calls to publish every result come with caveats. A possible solution is the establishment of a repository for very brief reports indicating the failure of experimental tests to yield positive results. Such a repository would be easily searchable, avoid clogging up journals, and require minimal effort on the part of busy scientists with ticking tenure clocks.

Coda

I have kept this discussion qualitative, and have purposely avoided mathematical or statistical details in order to maximize generality and accessibility. There are lots of important points to be made regarding methodology, replication, and publication bias that I have sidestepped. Hopefully it has been useful nevertheless.

Interactional Complexity and Human Societies

We are interested in understanding various aspects of human societies.Since the structural and functional behavior of human societies undoubtedly qualifies as a complex system, it is useful to discuss certain terminology and philosophical concepts related to the organization of complex systems. William Wimsatt’s (1974) notion of interactional complexity will be particularly useful but is not widely appreciated, and so I will go into some detail to clarify this concept.

Decompositions and descriptive complexity

Stuart Kauffman (1971) presciently noted that, when describing a complex system, different descriptions of the system and resulting articulations of parts, or decompositions, might be varyingly useful depending on the purpose of the analysis, and that these descriptions might be non-isomorphic. That is, the delineations of the constituent parts may not coincide between different decompositions.

Wimsatt’s major insight was to note that relationships between the different decompositions of a system could be used to denote their intrinsic complexity. As an example, he compared a chunk of granite with the fruit fly Drosophila melanogaster (see Fig. 1 in Wimsatt, 1974). The chunk of granite can be described via a decomposition into parts grouped by (for example) chemical composition, thermal conductivity, electrical conductivity, density, or tensile strength. Although these decompositions are not completely isomorphic, some of the boundaries between parts are shared between each description (e.g., a section with a specific density will also have a specific tensile strength and chemical composition relative to the neighboring parts). The fruit fly, meanwhile, can also be described by decomposition into parts based on (for example) anatomical organs, cell types, developmental gradients, biochemical reactions (i.e., the local presence of reaction types), or physiological systems (as described by cybernetic flow diagrams). In contrast to the granite chunk, the boundaries between the parts of the various decompositions are not spatially coincident, and indeed, the last two items on the list are not evenly clearly describable in a coherent spatial manner. Wimsatt introduced the term descriptive complexity to indicate the degree to which the spatial boundaries of various descriptive decompositions coincide. A fruit fly is thus more descriptively complex than a chunk of granite.

Interactional complexity

A system can often be described in terms of subsystems, each of which has a specific set of parts. We can constrain this description by specifying that, for the parts within these subsystems, the causal relations with other parts within the subsystem should be much stronger than the causal relations with parts from other subsystems. Indeed, this constraint helps delineate each subsystem from the others, and might be seen as the degree to which a valid prediction of the system behavior could assessed by only considering the behavior of each subsystem, ignoring interactions between them. Remember, however, that there may be many useful decompositions of the system into subsystems, each with its own set of constituent parts.

We say that a system under these constraints is interactionally simple if there are only weak causal relationships between the parts of a subsystem in one decomposition and the parts of a different subsystem in a different decompositional description, and interactionally complex to the degree to which those causal relations are strong. Put more bluntly, a system has a high degree of interactional complexity if an investigator must consider the system from more than one theoretical perspective (i.e., more than one decomposition) in order to make useful predictions. Driving the point home, Wimsatt writes, “If the system is descriptively complex and is also interactionally complex for more than a very small number of interactions, the investigator is forced to analyze the relations of parts for virtually all parts in the different decompositions, and probably even to construct connections between the different perspectives at the theoretical level.” (1974, p. 74). Forty years after Wimsatt’s paper first appeared, this idea may no longer be revelatory, but I maintain that it is still underappreciated.

Human societies are interactionally complex

It seems obvious that human societies are descriptively complex. We can describe societies at the level of individuals, in terms of nuclear families, kin groups, subcultures, and social classes. We can also include infrastructure and transportation, livestock and farming, religious rituals and linguistic traditions. This is all on top of the descriptive complexity of an individual human, which I believe we can agree is at least as great as that of a fruit fly.

Importantly, human societies, and the human groups that comprise societies, are also interactionally complex. Perspectives include genetic, neurological, cognitive, familial, cultural, and ecological. To at least some extent, we can’t ignore any of them.

References

  • Kauffman, S. A. (1970). Articulation of parts explanation in biology and the rational search for them. In: PSA 1970, ed. R. C. Buck & R. S. Cohen, pp. 257–72. Philosophy of Science Association.
  • Wimsatt, W. C. (1974). Complexity and organization. In: PSA 1972, ed. K. Schaffner & R. S. Cohen, pp. 67–86. Philosophy of Science Association.