Incurably Nuanced: 36 is the new 42

Monday, February 22, 2016

36 is the new 42

Once upon a time, according to Douglas Adams in The Hitchhiker’s Guide to the Galaxy, a group of hyperintelligent, pandimensional beings took it upon themselves to build an especially fancy supercomputer, which was charged with the task of deducing the Answer to Life, the Universe, and Everything. After seven and half million years of intensive processing, the supercomputer announced the Answer to the breathlessly waiting crowd.

The Answer, it turned out, was...42.

Of course, the problem was that no one quite knew what the question was.

Recently, in our own universe, a commotion broke out amidst a group of highly intelligent beings over a different number. According to the Reproducibility Project, 36% of a set of attempted replication studies produced significant results.

36, it turned out, was the Answer.

It’s just not quite clear what the question was.

Think of it this way. Let’s say I told you that a new restaurant opened down the street, and it has a three star rating. Or that I learned I scored a 17 on happiness.

Three stars out of what, you would say, like a good scientist. 17 of how many? Out of 18? 25? 100?

We have the same problem with the number 36. We know the top of the ratio, but not the bottom. The answer, but not the question.

The question, for example, is not “How many studies out of 100 will replicate,” because only 97 of the original 100 studies reported a significant effect. “Fine,” you might think, rolling your eyes ever so slightly, “it’s technically how many out of 97.” But wait, that can’t be right either, because it assumes the replicating study has 100% power. If we were to run 100 studies at 80% power, we would expect about 80 to reach significance.

“Aha,” you say, sure that you’ve got me. To determine the bottom of the ratio, we just need to multiply 97 by the power of the replication studies. For instance, if the typical power of the replication studies was 80%, we’d expect about 78 exact replications to hit significance, whereas if the typical power was 40%, we’d expect fewer than 39 exact replications to hit significance.

So: What was the typical power of the studies in the Reproducibility Project?

Well, based on the observed effect sizes in the original 97 studies, the authors of the Science article use classic power analyses to estimate that their average power was 92%, so that 89 exact replications should have hit significance. But of course, we know that publication bias can lead classic power analyses to woefully underestimate the number of participants needed to achieve adequate power, especially for small-to-medium sized effects and especially when the original study N isn’t very big (see Perugini et al., 2014 PPS, for an excellent discussion of this issue). For instance, to replicate an original study testing d = .5 with N = 60, Perugini and colleagues recommend an N that’s over three times as large as the N suggested by a classic power analysis (N = 324 instead of N = 102).

Okay, so we know that 89 is likely to be far too large as an estimate for the bottom of our ratio. But the number shrinks even further when we consider the need to adjust for effect size heterogeneity, which also causes classic power analyses to yield overoptimistic estimates. As effect size heterogeneity increases, the required sample size to attain adequate power increases sharply; a study that you thought was powered at 80%, for instance, might be powered at only 60% when the effect size is small and heterogeneity is moderate (see McShane & Bockenholt, 2014 PPS).

How much further does the bottom of our ratio shrink when we account for publication bias and effect size heterogeneity? It depends on the true effect sizes tested in the original studies, the extent to which publication bias affected those particular studies, and the degree of effect size heterogeneity—all numbers that we do not know and that we are arguably ill equipped to estimate. Could it be just a little shrinkage? Sure. Maybe the bottom of the ratio drops to about 70. That means out of an expected 70 significant results, we saw just 36, providing us with a dishearteningly low 51% success rate.

But could it be a lot? Yes, especially if the original studies were relatively small, subject to strong publication bias, and characterized by heterogeneous effect sizes (all quite plausible characterizations in this case). Maybe it drops to about 40. That means out of an expected 40 significant results, we saw 36, providing us with a remarkably high 90% success rate.

Of course, now I’m just pulling numbers out of thin air. But this is precisely my point. If you find yourself balking at one or the other of these estimates—51% seems way too low, or 90% seems way too high—that’s based on your intuition, not on evidence. We don’t know—and in fact cannot know—what the bottom of that ratio is. We don’t know the question that we’re asking. There are many things we can learn from the Reproducibility Project. But the meaning of 36?

Not that.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Bonus nuance:
It is well worth noting that the 270 authors of the (excellent and very nuanced) paper reporting the Reproducibility Project emphasize the distinctly un-magical nature of the number 36, and take considerable care to unpack what we can and cannot learn from the various results reported in the article (H/T Daniel Lakens for highlighting this important point in a recent FB discussion). So, you know, a shorter version of this post might have been "Oops, we forgot the nuance! See original article." But I wanted an excuse to cite Douglas Adams.

5 comments:

Alex DanversFebruary 24, 2016 at 8:13 AM
I found this a very frustrating read, because it doesn't explore the practical meaning of each of "denominator values" you suggest.

If you think that when you read a random article in a social psychology journal, you should be able to take it's results as a pretty good statement of "what would happen if I tried this at home," then your denominator should be 100. This is the typical standard that we hold science to: what was demonstrated is something that, in general, I can reliably expect to happen given the same set of circumstances.

If, on the other hand, you already take it as a given that psychology "cuts corners" and doesn't meet the typically rigorous standards of "capital S" science, you can build in more and more generous correction factors. Well, the official recommendation from Jacob Cohen is 80% power, which means that we end up with inflated effect sizes and slightly lower power to detect the same effect again than we otherwise would. So I'm going to knock down my confidence in any given psychology result a bit.

Wait, but what about between study variability? What about the empirical fact that people often don't get to even 80% power? Let me handicap my expectations further.

These handicaps don't mean that 36% of original results didn't happen the same way when we repeated them. They are ways of re-calibrating expectations according to how low we're willing to set our standards as a field. They may be useful as retrospective explanations--"aha, there are very legitimate reasons why we got such a low number"--but they don't invalidate the importance of 36%. Do you want to read the literature of a field where you know that only 36% of results would come out the same way if you repeated the experiment? Would you want to teach those results to students?

36% should be a wake-up call. It is most definitely not some random number generated by a computer constructed by pan-dimensional being.
ReplyDelete
Replies
Katie SurrenceMarch 30, 2016 at 6:56 AM
Alison, doesn't what you are saying amount to: if we assume (as we probably should) that the original studies replicated in the RPP have highly inflated effect sizes then we should have never expected a high number of them to replicate in the first place? This is fair as far as it goes, but isn't that precisely what people who read the RPP pessimistically are reacting to with consternation? I suppose this just another version of Alex's comment. I am maybe a little more neutral than he is about what the "right" rate of false positives in the literature is, but I don't see how lowering the denominator because you don't think published studies estimate effect sizes well ought to make you more cheerful in response to the RPP.
ReplyDelete
Replies

Add comment