Incurably Nuanced: February 2016

Once upon a time, according to Douglas Adams in The Hitchhiker’s Guide to the Galaxy, a group of hyperintelligent, pandimensional beings took it upon themselves to build an especially fancy supercomputer, which was charged with the task of deducing the Answer to Life, the Universe, and Everything. After seven and half million years of intensive processing, the supercomputer announced the Answer to the breathlessly waiting crowd.

The Answer, it turned out, was...42.

Of course, the problem was that no one quite knew what the question was.

Recently, in our own universe, a commotion broke out amidst a group of highly intelligent beings over a different number. According to the Reproducibility Project, 36% of a set of attempted replication studies produced significant results.

36, it turned out, was the Answer.

It’s just not quite clear what the question was.

Think of it this way. Let’s say I told you that a new restaurant opened down the street, and it has a three star rating. Or that I learned I scored a 17 on happiness.

Three stars out of what, you would say, like a good scientist. 17 of how many? Out of 18? 25? 100?

We have the same problem with the number 36. We know the top of the ratio, but not the bottom. The answer, but not the question.

The question, for example, is not “How many studies out of 100 will replicate,” because only 97 of the original 100 studies reported a significant effect. “Fine,” you might think, rolling your eyes ever so slightly, “it’s technically how many out of 97.” But wait, that can’t be right either, because it assumes the replicating study has 100% power. If we were to run 100 studies at 80% power, we would expect about 80 to reach significance.

“Aha,” you say, sure that you’ve got me. To determine the bottom of the ratio, we just need to multiply 97 by the power of the replication studies. For instance, if the typical power of the replication studies was 80%, we’d expect about 78 exact replications to hit significance, whereas if the typical power was 40%, we’d expect fewer than 39 exact replications to hit significance.

So: What was the typical power of the studies in the Reproducibility Project?

Well, based on the observed effect sizes in the original 97 studies, the authors of the Science article use classic power analyses to estimate that their average power was 92%, so that 89 exact replications should have hit significance. But of course, we know that publication bias can lead classic power analyses to woefully underestimate the number of participants needed to achieve adequate power, especially for small-to-medium sized effects and especially when the original study N isn’t very big (see Perugini et al., 2014 PPS, for an excellent discussion of this issue). For instance, to replicate an original study testing d = .5 with N = 60, Perugini and colleagues recommend an N that’s over three times as large as the N suggested by a classic power analysis (N = 324 instead of N = 102).

Okay, so we know that 89 is likely to be far too large as an estimate for the bottom of our ratio. But the number shrinks even further when we consider the need to adjust for effect size heterogeneity, which also causes classic power analyses to yield overoptimistic estimates. As effect size heterogeneity increases, the required sample size to attain adequate power increases sharply; a study that you thought was powered at 80%, for instance, might be powered at only 60% when the effect size is small and heterogeneity is moderate (see McShane & Bockenholt, 2014 PPS).

How much further does the bottom of our ratio shrink when we account for publication bias and effect size heterogeneity? It depends on the true effect sizes tested in the original studies, the extent to which publication bias affected those particular studies, and the degree of effect size heterogeneity—all numbers that we do not know and that we are arguably ill equipped to estimate. Could it be just a little shrinkage? Sure. Maybe the bottom of the ratio drops to about 70. That means out of an expected 70 significant results, we saw just 36, providing us with a dishearteningly low 51% success rate.

But could it be a lot? Yes, especially if the original studies were relatively small, subject to strong publication bias, and characterized by heterogeneous effect sizes (all quite plausible characterizations in this case). Maybe it drops to about 40. That means out of an expected 40 significant results, we saw 36, providing us with a remarkably high 90% success rate.

Of course, now I’m just pulling numbers out of thin air. But this is precisely my point. If you find yourself balking at one or the other of these estimates—51% seems way too low, or 90% seems way too high—that’s based on your intuition, not on evidence. We don’t know—and in fact cannot know—what the bottom of that ratio is. We don’t know the question that we’re asking. There are many things we can learn from the Reproducibility Project. But the meaning of 36?

Not that.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Bonus nuance:
It is well worth noting that the 270 authors of the (excellent and very nuanced) paper reporting the Reproducibility Project emphasize the distinctly un-magical nature of the number 36, and take considerable care to unpack what we can and cannot learn from the various results reported in the article (H/T Daniel Lakens for highlighting this important point in a recent FB discussion). So, you know, a shorter version of this post might have been "Oops, we forgot the nuance! See original article." But I wanted an excuse to cite Douglas Adams.

There is, it turns out, a whole literature out there on persuasion and information processing. On what makes people think carefully and systematically, and what pushes them to seize and freeze on arbitrary heuristics. On what makes us close-minded and stubborn versus open-minded and willing to consider alternative viewpoints.

Stress, for instance, increases close-mindedness—in high pressure situations, individuals tend to gravitate toward simple decision rules and groups tend to prioritize agreement over accuracy. Extreme positions tend to produce less acceptance and more counter-arguing than moderate positions, and threatening messages lead to defensive rather than open-minded information processing.*

If you had to distill this literature into one pithy take-home message, it might be this: If you go around shouting that the sky is falling, people will often stick their head in the sand.**

I think it’s this cyclical dynamic of shouting and head-burying that so often frustrates me about the unfolding conversation about best practices in psychological science and beyond. When I look around at where we are and where we’ve been, I often see a lot of smart, motivated, well-intentioned people eager for positive change. Most (maybe all?) of us can agree that our science isn’t perfect, that there’s room for improvement. And most of us share the same overarching goal: We want to make our science better. We want to maximize the information we get from the work that we do.

Sometimes, the air is full of thoughtful, nuanced conversations about the numerous possible strategies for working toward that shared goal. Other times, all I can hear is people shouting “PANIC!” and proposing a list of new arbitrary decision rules to replace the old ones (like the dichotomy-happy reification of p < .05) that arguably played a major role in producing many of our field’s current problems in the first place.

There is such potential here, in this moment of our evolving science, to move beyond cut-off values and oversimplification and arbitrary decision rules. There is change in the air. The ship has set sail. We’re moving.

What we need now is nuance. We need thoughtful conversations about the best courses to chart, the most promising routes, how to navigate around unexpected boulders. We need open-minded discussions that involve not just talking but also listening.

So let’s leave the sand behind, please, and let’s also quit telling people to panic. There’s a vast horizon out there. We’re in motion. Let’s stop shouting and start steering.

* See e.g., Kruglanski et al., 2006; Ledgerwood, Callahan, & Chaiken, 2014; Liberman & Chaiken, 1992; Sherif & Hovland, 1961; Sherman & Cohen, 2002

** H/T Eli Finkel for the bird-based metaphors.

Incurably Nuanced

Monday, February 22, 2016

36 is the new 42

Thursday, February 18, 2016

Chicken Little and the Ostrich