Monday, February 22, 2016

36 is the new 42

Once upon a time, according to Douglas Adams in The Hitchhiker’s Guide to the Galaxy, a group of hyperintelligent, pandimensional beings took it upon themselves to build an especially fancy supercomputer, which was charged with the task of deducing the Answer to Life, the Universe, and Everything. After seven and half million years of intensive processing, the supercomputer announced the Answer to the breathlessly waiting crowd.

The Answer, it turned out, was...42.

Of course, the problem was that no one quite knew what the question was.

Recently, in our own universe, a commotion broke out amidst a group of highly intelligent beings over a different number. According to the Reproducibility Project, 36% of a set of attempted replication studies produced significant results.

36, it turned out, was the Answer.

It’s just not quite clear what the question was.

Think of it this way. Let’s say I told you that a new restaurant opened down the street, and it has a three star rating. Or that I learned I scored a 17 on happiness.
 
Three stars out of what, you would say, like a good scientist. 17 of how many? Out of 18? 25? 100?

We have the same problem with the number 36. We know the top of the ratio, but not the bottom. The answer, but not the question.

The question, for example, is not “How many studies out of 100 will replicate,” because only 97 of the original 100 studies reported a significant effect. “Fine,” you might think, rolling your eyes ever so slightly, “it’s technically how many out of 97.” But wait, that can’t be right either, because it assumes the replicating study has 100% power. If we were to run 100 studies at 80% power, we would expect about 80 to reach significance.

“Aha,” you say, sure that you’ve got me. To determine the bottom of the ratio, we just need to multiply 97 by the power of the replication studies. For instance, if the typical power of the replication studies was 80%, we’d expect about 78 exact replications to hit significance, whereas if the typical power was 40%, we’d expect fewer than 39 exact replications to hit significance.

So: What was the typical power of the studies in the Reproducibility Project?

Well, based on the observed effect sizes in the original 97 studies, the authors of the Science article use classic power analyses to estimate that their average power was 92%, so that 89 exact replications should have hit significance. But of course, we know that publication bias can lead classic power analyses to woefully underestimate the number of participants needed to achieve adequate power, especially for small-to-medium sized effects and especially when the original study N isn’t very big (see Perugini et al., 2014 PPS, for an excellent discussion of this issue). For instance, to replicate an original study testing d = .5 with N = 60, Perugini and colleagues recommend an N that’s over three times as large as the N suggested by a classic power analysis (N = 324 instead of N = 102).  


Okay, so we know that 89 is likely to be far too large as an estimate for the bottom of our ratio. But the number shrinks even further when we consider the need to adjust for effect size heterogeneity, which also causes classic power analyses to yield overoptimistic estimates. As effect size heterogeneity increases, the required sample size to attain adequate power increases sharply; a study that you thought was powered at 80%, for instance, might be powered at only 60% when the effect size is small and heterogeneity is moderate (see McShane & Bockenholt, 2014 PPS).

How much further does the bottom of our ratio shrink when we account for publication bias and effect size heterogeneity? It depends on the true effect sizes tested in the original studies, the extent to which publication bias affected those particular studies, and the degree of effect size heterogeneity—all numbers that we do not know and that we are arguably ill equipped to estimate. Could it be just a little shrinkage? Sure. Maybe the bottom of the ratio drops to about 70. That means out of an expected 70 significant results, we saw just 36, providing us with a dishearteningly low 51% success rate.

But could it be a lot? Yes, especially if the original studies were relatively small, subject to strong publication bias, and characterized by heterogeneous effect sizes (all quite plausible characterizations in this case). Maybe it drops to about 40. That means out of an expected 40 significant results, we saw 36, providing us with a remarkably high 90% success rate.

Of course, now I’m just pulling numbers out of thin air. But this is precisely my point. If you find yourself balking at one or the other of these estimates—51% seems way too low, or 90% seems way too high—that’s based on your intuition, not on evidence. We don’t know—and in fact cannot know—what the bottom of that ratio is. We don’t know the question that we’re asking. There are many things we can learn from the Reproducibility Project. But the meaning of 36?

Not that.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Bonus nuance:
It is well worth noting that the 270 authors of the (excellent and very nuanced) paper reporting the Reproducibility Project emphasize the distinctly un-magical nature of the number 36, and take considerable care to unpack what we can and cannot learn from the various results reported in the article (H/T Daniel Lakens for highlighting this important point in a recent FB discussion). So, you know, a shorter version of this post might have been "Oops, we forgot the nuance! See original article." But I wanted an excuse to cite Douglas Adams.

5 comments:

  1. I found this a very frustrating read, because it doesn't explore the practical meaning of each of "denominator values" you suggest.

    If you think that when you read a random article in a social psychology journal, you should be able to take it's results as a pretty good statement of "what would happen if I tried this at home," then your denominator should be 100. This is the typical standard that we hold science to: what was demonstrated is something that, in general, I can reliably expect to happen given the same set of circumstances.

    If, on the other hand, you already take it as a given that psychology "cuts corners" and doesn't meet the typically rigorous standards of "capital S" science, you can build in more and more generous correction factors. Well, the official recommendation from Jacob Cohen is 80% power, which means that we end up with inflated effect sizes and slightly lower power to detect the same effect again than we otherwise would. So I'm going to knock down my confidence in any given psychology result a bit.

    Wait, but what about between study variability? What about the empirical fact that people often don't get to even 80% power? Let me handicap my expectations further.

    These handicaps don't mean that 36% of original results didn't happen the same way when we repeated them. They are ways of re-calibrating expectations according to how low we're willing to set our standards as a field. They may be useful as retrospective explanations--"aha, there are very legitimate reasons why we got such a low number"--but they don't invalidate the importance of 36%. Do you want to read the literature of a field where you know that only 36% of results would come out the same way if you repeated the experiment? Would you want to teach those results to students?

    36% should be a wake-up call. It is most definitely not some random number generated by a computer constructed by pan-dimensional being.

    ReplyDelete
    Replies
    1. Thanks for sharing your thoughts, Alex! I think it would be useful to take a closer look at some of these assumptions.

      First, I think we need to ask whether we should really define “capital S” science as “what was demonstrated is something that, in general, I can reliably expect to happen given the same set of circumstances.” Is a scientific finding one that happens 100% of the time?

      If a medical study found that taking a new drug reduces the length of a cold by 50% on average, compared to a placebo group, does that mean that when I take the new drug my cold will definitely be gone in half the time? Well, no, because 50% shorter was the average effect across the participants in the study sample. The size (and even direction) of the effect may have differed across the study participants, I may be different from the study participants, and my cold may be different from the cold experienced by the participants in this study. What about a study in which researchers inject genetically identical mice with Substance A versus Substance B, and find that the mice in the Substance A condition show a higher number of t cells in their footpads? Well, if there is any unreliability (i.e., inconsistency) at all in the manipulation and measure…if sometimes the injection leads to a tiny bit more or less Substance A, say, depending on how the mouse’s head is tilted when you administer the injection…then again, the results of this experiment will fluctuate from one iteration to the next.

      Taking a more mathematical approach, we can ask what we should expect to find when we replicate a scientific experiment in the presence of sampling error (the first example above) and/or measurement error (the second example above). This open access article by Stanley & Spence provides a great discussion of these issues.

      I think the question you ask about what is the ideal replicability rate for a given literature is also very interesting. I might start with the slightly simpler question of what is the ideal false positive rate for a given scientific domain? Do we want it to be zero—all statistically significant effects are definitely true? That certainly sounds like a good idea on the surface. But now let’s consider a potential downside to this approach…as we get more and more conservative about our false positive rate, we also decrease our ability to discover (or the efficiency with which we can discover) true effects. Most often, scientists talk about this as a Type I/Type II error tradeoff, but we can also think of this idea in terms of scientists having a finite pool of resources (participants, money, time, personnel, etc.) and having to decide how to apportion those resources (see e.g., Lakens, 2014, EJSP). If we wanted to prioritize the rate of true scientific discoveries, then, or the efficiency with which we can discover true things about the world, we actually wouldn’t want a false positive rate of zero. What rate would we want? That’s an excellent question that deserves some careful consideration.

      Delete
    2. Stanley & Spence, 2014: http://pps.sagepub.com/content/9/3/305.abstract

      Delete
  2. Alison, doesn't what you are saying amount to: if we assume (as we probably should) that the original studies replicated in the RPP have highly inflated effect sizes then we should have never expected a high number of them to replicate in the first place? This is fair as far as it goes, but isn't that precisely what people who read the RPP pessimistically are reacting to with consternation? I suppose this just another version of Alex's comment. I am maybe a little more neutral than he is about what the "right" rate of false positives in the literature is, but I don't see how lowering the denominator because you don't think published studies estimate effect sizes well ought to make you more cheerful in response to the RPP.

    ReplyDelete
    Replies
    1. Hi Katie,
      I agree completely with a slightly modified version of your statement, namely:
      If we assume (as we probably should) that the original studies replicated in the RPP have highly inflated effect sizes, then we should have never expected a high number of them to replicate in the first place **without increasing the sample size to provide adequate power to detect the effect.**

      In other words, the power analyses conducted by the authors in the RP:P were conducted to assess how large the replication sample needed to be to have a high probability of detecting the effect observed in the initial study if it were a true effect. Those power analyses assumed zero publication bias and zero effect size heterogeneity, which are definitely not reasonable assumptions to make. So we know the studies were underpowered to detect the effects of interest, but we don’t know (and can’t really know) by how much. Another way to say this is that if the studies had actually been highly powered (i.e., conducted with larger sample sizes), we have no idea how many of the original effects would have been statistically significant.

      This has zero implications for how cheerful or concerned we should be.

      All it means is that if you want a measure of how imperfect our field is, look elsewhere. The percentage of significant studies in the RP:P is just not a good measure of this particular thing. We have other measures, though, that are far more informative on this front and that converge to tell us there are problems (e.g., estimates of publication bias across the sciences). And there are ways of using the RP:P data to provide a better measure-—see, for instance, the Bayesian re-analysis here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0149794. (Although note that you don’t really need the RP:P replication data to arrive at the conclusion that the original studies don’t provide us with as much information as we would like about the effects they were intended to study-—we could have gotten that from just looking at the imprecision of the original studies themselves.)

      Where does this leave us? The field is in need of improvement! How much improvement? I find it difficult to care deeply about this question, although some do. I think it essential to acknowledge that there are problems, that we don’t know as much as we thought we did from the research in the published literature, and that we need to improve our methods and practices and conduct new research to accumulate more information. But I would far rather focus on taking those essential steps to improve our science, right away (see the next post on baby steps for more), than to sit around hotly debating whether we know 40% versus 70% of what we thought we did. Either way, I see huge room for improvement. So let’s improve.

      Delete