Sunday, March 27, 2016

Baby Steps

If you hadn’t noticed by now, I can be indecently peppy. Over at PsychMAP, when my (brilliant and thoughtful) co-moderator gloomily surveys the imperfections of our field and threatens to drown his sorrows in drink, I find myself invariably springing in like Tigger with a positive reframing. We’ve found flaws in our practices? What a chance to improve! An effect didn’t replicate? Science is sciencing! Eeyore’s tail goes missing? Think of all the opportunities!

Every now and then, though, I stare at this field that I love without recognizing it, and I want to sit down in the middle of the floor and cry. Or swear. Or quit and open a very small but somehow solvent restaurant that serves delicious food with fancy wine pairings to a grand total of six people per night.

The thing that gets me to land mid-floor with a heavy and heart-sinking whomp is not discovering imperfections about our past or noting the lurching, baby-giraffe-like gait of our field’s uneven but promising steps toward progress—it’s when the yelling gets so loud, so polarized, so hands-over-my-ears-I-can’t-hear-you-la-la-la that it drowns out everything else. It’s when Chicken Little and the Ostrich have a screaming match. It’s when people stop being able to listen.

I’ve written before about some of the unintended and damaging consequences that this kind of tone can have, and here’s another: Add these loud debates to the shifting standards and policies in our field right now, and the average researcher’s instinct might be, quite reasonably, to freeze. If it’s unclear which way things are going and each side thinks the other is completely and hopelessly wrong about everything, then maybe the best course of action is to keep your head down, continue with business as usual, and wait to see how things shake out. What’s the point of paying attention yet if nobody can agree on anything? Why start trying to change if the optimal endpoint is still up for debate?

If you find yourself thinking something along these lines from time to time, I am here to say (passionately and melodramatically from where I sit contemplating my future prospects as a miniature restauranteur, plopped in the middle of the floor):


Here’s the thing: We don’t have to agree on exactly where we’re going, and we certainly don’t have to agree on the exact percentage of imperfection in our field, to agree that our current practices aren’t perfect. Of course they’re not perfect—nothing is perfect. We can argue about exactly how imperfect they are, and how to measure imperfection, and those discussions can sometimes be very productive. But they’re not a necessary precondition for positive change.

In fact, all that’s needed for positive change is for each of us to acknowledge that our research practices aren’t perfect, and to identify a step—even a very small, incremental, baby step—that would make them a little better. And then to take the step. Even if it’s the tiniest baby step imaginable in the history of the world. One step. And then to look around for another one.

So for example, a few years ago, my lab took a baby step. We had a lab meeting, which was typical. We talked about a recent set of articles on research practices, which was also typical. But this time, we asked ourselves a new question: In light of what we know now about issues like power, p-hacking, meta-analysis, and new or newly rediscovered tools like sequential analyses, what are some strategies that we could adopt right now to help distinguish between findings that we trust a lot and findings that are more tentative and exploratory?

We made a list. We talked about distinguishing between exploratory and confirmatory analyses. We talked about power and what to do when a power analysis wasn’t possible. We generated some arbitrary heuristics about target sample sizes. We talked about how arbitrary they were. We scratched some things off the list. We added a few more.

We titled our list “Lab Guidelines for Best Practices,” although in retrospect, the right word would have been “Better” rather than “Best.” We put a date at the top. We figured it would evolve (and it did). We decided we didn’t care if it was perfect. It was a step in the right direction. We decided that, starting today, we will follow these guidelines for all new projects.


We created a new form that we called an Experiment Archive Form to guide us. (It evolved, and continues to evolve, along with our guidelines for better practices. Latest version available here.) 

And starting with these initial steps, our research got better. We now get to trust our findings more—to learn more from the work that we do. We go on fewer wild goose chases. We discover cool new moderators. We know when we can count on an effect and when we should be more tentative.

But is there still room for improvement? Surely there is always room for improvement.

So we look around.

You look around, too.

What’s one feasible, positive next step?


---
Some places to look:

        Braver, Thoemmes, & Rosenthal (2014 PPS): Conducting small-scale, cumulative meta-analyses to get a continually updating sense of your results.
        Gelman & Loken (2014 American Scientist): Clear, concise discussion of the “garden of forking paths” and the importance of acknowledging when analyses are data-dependent.
        Judd, Westfall, & Kenny (2012 JPSP): Treating stimuli as a random factor, boosting power in studies that use samples of stimuli.
        Lakens & Evers (2014 PPS): Practical recommendations to increase the informational values of studies.
        Ledgerwood, Soderberg, & Sparks (in press chapter): Strategies for calculating and boosting power, distinguishing between exploratory and confirmatory analyses, pros and cons of online samples, when and why to conduct direct, systematic, and conceptual replications.
        Maner (2014 PPS): Positive steps you can take as a reviewer or editor.
        PsychMap: A Facebook group for constructive, open-minded, and nuanced conversations about Psychological Methods and Practices.

Monday, February 22, 2016

36 is the new 42

Once upon a time, according to Douglas Adams in The Hitchhiker’s Guide to the Galaxy, a group of hyperintelligent, pandimensional beings took it upon themselves to build an especially fancy supercomputer, which was charged with the task of deducing the Answer to Life, the Universe, and Everything. After seven and half million years of intensive processing, the supercomputer announced the Answer to the breathlessly waiting crowd.

The Answer, it turned out, was...42.

Of course, the problem was that no one quite knew what the question was.

Recently, in our own universe, a commotion broke out amidst a group of highly intelligent beings over a different number. According to the Reproducibility Project, 36% of a set of attempted replication studies produced significant results.

36, it turned out, was the Answer.

It’s just not quite clear what the question was.

Think of it this way. Let’s say I told you that a new restaurant opened down the street, and it has a three star rating. Or that I learned I scored a 17 on happiness.
 
Three stars out of what, you would say, like a good scientist. 17 of how many? Out of 18? 25? 100?

We have the same problem with the number 36. We know the top of the ratio, but not the bottom. The answer, but not the question.

The question, for example, is not “How many studies out of 100 will replicate,” because only 97 of the original 100 studies reported a significant effect. “Fine,” you might think, rolling your eyes ever so slightly, “it’s technically how many out of 97.” But wait, that can’t be right either, because it assumes the replicating study has 100% power. If we were to run 100 studies at 80% power, we would expect about 80 to reach significance.

“Aha,” you say, sure that you’ve got me. To determine the bottom of the ratio, we just need to multiply 97 by the power of the replication studies. For instance, if the typical power of the replication studies was 80%, we’d expect about 78 exact replications to hit significance, whereas if the typical power was 40%, we’d expect fewer than 39 exact replications to hit significance.

So: What was the typical power of the studies in the Reproducibility Project?

Well, based on the observed effect sizes in the original 97 studies, the authors of the Science article use classic power analyses to estimate that their average power was 92%, so that 89 exact replications should have hit significance. But of course, we know that publication bias can lead classic power analyses to woefully underestimate the number of participants needed to achieve adequate power, especially for small-to-medium sized effects and especially when the original study N isn’t very big (see Perugini et al., 2014 PPS, for an excellent discussion of this issue). For instance, to replicate an original study testing d = .5 with N = 60, Perugini and colleagues recommend an N that’s over three times as large as the N suggested by a classic power analysis (N = 324 instead of N = 102).  


Okay, so we know that 89 is likely to be far too large as an estimate for the bottom of our ratio. But the number shrinks even further when we consider the need to adjust for effect size heterogeneity, which also causes classic power analyses to yield overoptimistic estimates. As effect size heterogeneity increases, the required sample size to attain adequate power increases sharply; a study that you thought was powered at 80%, for instance, might be powered at only 60% when the effect size is small and heterogeneity is moderate (see McShane & Bockenholt, 2014 PPS).

How much further does the bottom of our ratio shrink when we account for publication bias and effect size heterogeneity? It depends on the true effect sizes tested in the original studies, the extent to which publication bias affected those particular studies, and the degree of effect size heterogeneity—all numbers that we do not know and that we are arguably ill equipped to estimate. Could it be just a little shrinkage? Sure. Maybe the bottom of the ratio drops to about 70. That means out of an expected 70 significant results, we saw just 36, providing us with a dishearteningly low 51% success rate.

But could it be a lot? Yes, especially if the original studies were relatively small, subject to strong publication bias, and characterized by heterogeneous effect sizes (all quite plausible characterizations in this case). Maybe it drops to about 40. That means out of an expected 40 significant results, we saw 36, providing us with a remarkably high 90% success rate.

Of course, now I’m just pulling numbers out of thin air. But this is precisely my point. If you find yourself balking at one or the other of these estimates—51% seems way too low, or 90% seems way too high—that’s based on your intuition, not on evidence. We don’t know—and in fact cannot know—what the bottom of that ratio is. We don’t know the question that we’re asking. There are many things we can learn from the Reproducibility Project. But the meaning of 36?

Not that.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Bonus nuance:
It is well worth noting that the 270 authors of the (excellent and very nuanced) paper reporting the Reproducibility Project emphasize the distinctly un-magical nature of the number 36, and take considerable care to unpack what we can and cannot learn from the various results reported in the article (H/T Daniel Lakens for highlighting this important point in a recent FB discussion). So, you know, a shorter version of this post might have been "Oops, we forgot the nuance! See original article." But I wanted an excuse to cite Douglas Adams.

Thursday, February 18, 2016

Chicken Little and the Ostrich

There is, it turns out, a whole literature out there on persuasion and information processing. On what makes people think carefully and systematically, and what pushes them to seize and freeze on arbitrary heuristics. On what makes us close-minded and stubborn versus open-minded and willing to consider alternative viewpoints.

Stress, for instance, increases close-mindedness—in high pressure situations, individuals tend to gravitate toward simple decision rules and groups tend to prioritize agreement over accuracy. Extreme positions tend to produce less acceptance and more counter-arguing than moderate positions, and threatening messages lead to defensive rather than open-minded information processing.*


If you had to distill this literature into one pithy take-home message, it might be this: If you go around shouting that the sky is falling, people will often stick their head in the sand.**
 
I think it’s this cyclical dynamic of shouting and head-burying that so often frustrates me about the unfolding conversation about best practices in psychological science and beyond. When I look around at where we are and where we’ve been, I often see a lot of smart, motivated, well-intentioned people eager for positive change. Most (maybe all?) of us can agree that our science isn’t perfect, that there’s room for improvement. And most of us share the same overarching goal: We want to make our science better. We want to maximize the information we get from the work that we do.

Sometimes, the air is full of thoughtful, nuanced conversations about the numerous possible strategies for working toward that shared goal. Other times, all I can hear is people shouting “PANIC!” and proposing a list of new arbitrary decision rules to replace the old ones (like the dichotomy-happy reification of p < .05) that arguably played a major role in producing many of our field’s current problems in the first place.

There is such potential here, in this moment of our evolving science, to move beyond cut-off values and oversimplification and arbitrary decision rules. There is change in the air. The ship has set sail. We’re moving.

What we need now is nuance. We need thoughtful conversations about the best courses to chart, the most promising routes, how to navigate around unexpected boulders. We need open-minded discussions that involve not just talking but also listening.

So let’s leave the sand behind, please, and let’s also quit telling people to panic. There’s a vast horizon out there. We’re in motion. Let’s stop shouting and start steering.


* See e.g., Kruglanski et al., 2006; Ledgerwood, Callahan, & Chaiken, 2014; Liberman & Chaiken, 1992; Sherif & Hovland, 1961; Sherman & Cohen, 2002

** H/T Eli Finkel for the bird-based metaphors.