The Answer, it turned out, was...42.
Of course, the problem was that no one quite knew what the question was.
Recently, in our own universe, a commotion broke out amidst a group of highly intelligent beings over a different number. According to the Reproducibility Project, 36% of a set of attempted replication studies produced significant results.
36, it turned out, was the Answer.
It’s just not quite clear what the question was.
Think of it this way. Let’s say I told you that a new restaurant opened down the street, and it has a three star rating. Or that I learned I scored a 17 on happiness.
Three stars out of what, you would say, like a good scientist. 17 of how many? Out of 18? 25? 100?
We have the same problem with the number 36. We know the top of the ratio, but not the bottom. The answer, but not the question.
The question, for example, is not “How many studies out of 100 will replicate,” because only 97 of the original 100 studies reported a significant effect. “Fine,” you might think, rolling your eyes ever so slightly, “it’s technically how many out of 97.” But wait, that can’t be right either, because it assumes the replicating study has 100% power. If we were to run 100 studies at 80% power, we would expect about 80 to reach significance.
“Aha,” you say, sure that you’ve got me. To determine the bottom of the ratio, we just need to multiply 97 by the power of the replication studies. For instance, if the typical power of the replication studies was 80%, we’d expect about 78 exact replications to hit significance, whereas if the typical power was 40%, we’d expect fewer than 39 exact replications to hit significance.
So: What was the typical power of the studies in the Reproducibility Project?
Well, based on the observed effect sizes in the original 97 studies, the authors of the Science article use classic power analyses to estimate that their average power was 92%, so that 89 exact replications should have hit significance. But of course, we know that publication bias can lead classic power analyses to woefully underestimate the number of participants needed to achieve adequate power, especially for small-to-medium sized effects and especially when the original study N isn’t very big (see Perugini et al., 2014 PPS, for an excellent discussion of this issue). For instance, to replicate an original study testing d = .5 with N = 60, Perugini and colleagues recommend an N that’s over three times as large as the N suggested by a classic power analysis (N = 324 instead of N = 102).
Okay, so we know that 89 is likely to be far too large as an estimate for the bottom of our ratio. But the number shrinks even further when we consider the need to adjust for effect size heterogeneity, which also causes classic power analyses to yield overoptimistic estimates. As effect size heterogeneity increases, the required sample size to attain adequate power increases sharply; a study that you thought was powered at 80%, for instance, might be powered at only 60% when the effect size is small and heterogeneity is moderate (see McShane & Bockenholt, 2014 PPS).
How much further does the bottom of our ratio shrink when we account for publication bias and effect size heterogeneity? It depends on the true effect sizes tested in the original studies, the extent to which publication bias affected those particular studies, and the degree of effect size heterogeneity—all numbers that we do not know and that we are arguably ill equipped to estimate. Could it be just a little shrinkage? Sure. Maybe the bottom of the ratio drops to about 70. That means out of an expected 70 significant results, we saw just 36, providing us with a dishearteningly low 51% success rate.
But could it be a lot? Yes, especially if the original studies were relatively small, subject to strong publication bias, and characterized by heterogeneous effect sizes (all quite plausible characterizations in this case). Maybe it drops to about 40. That means out of an expected 40 significant results, we saw 36, providing us with a remarkably high 90% success rate.
Of course, now I’m just pulling numbers out of thin air. But this is precisely my point. If you find yourself balking at one or the other of these estimates—51% seems way too low, or 90% seems way too high—that’s based on your intuition, not on evidence. We don’t know—and in fact cannot know—what the bottom of that ratio is. We don’t know the question that we’re asking. There are many things we can learn from the Reproducibility Project. But the meaning of 36?
Not that.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Bonus nuance:
It is well worth noting that the 270 authors of the (excellent and very nuanced) paper reporting the Reproducibility Project emphasize the distinctly un-magical nature of the number 36, and take considerable care to unpack what we can and cannot learn from the various results reported in the article (H/T Daniel Lakens for highlighting this important point in a recent FB discussion). So, you know, a shorter version of this post might have been "Oops, we forgot the nuance! See original article." But I wanted an excuse to cite Douglas Adams.