Incurably Nuanced

Monday, October 22, 2018

WHY You Preregister Should Inform the WAY You Preregister

Sometime in the 2012-2014 range, as the reproducibility crisis was heating up in my particular corner of science, I fell in love with a new (for me) approach to research. I fell in love with the research question.

Before this time, most of my research had involved making directional predictions that were at least loosely based on an existing theoretical perspective, and the study was designed to confirm the prediction. I didn’t give much thought to whether, if the study did not come out as expected, it would help to falsify the prediction and associated theory. (The answer was usually no: When the study didn’t come out as expected, I questioned the quality of the study rather than the prediction. In contrast, when the study “worked,” I took it as evidence in support of the prediction. Kids at home: Don’t do this: It’s called motivated reasoning, and it’s very human but not particularly objective or useful for advancing science.)

But at some point, I began to realize that science, for me, was much more fun when I asked a question that would be interesting regardless of how the results turned out, rather than making a single directional prediction that would only be interesting if the results confirmed it. So my lab started asking questions. We would think through logical reasons that one might expect Pattern A vs. Pattern B vs. Pattern C, and then we would design a study to see which pattern occurred.* And, whenever we were familiar enough with the research paradigm and context to anticipate the likely researcher degrees of freedom we would encounter when analyzing our data, we would carefully specify a pre-analysis plan so that we could have high confidence in results that we knew were based on data-independent analyses.

One day, as my student was writing up some of these studies for the first time, we came across a puzzle. Should we describe the studies as “exploratory,” because we hadn’t made a clear directional prediction ahead of time? Or as “confirmatory,” because our analyses were all data-independent (that is, we had exactly followed a carefully specified pre-analysis plan when analyzing the results)?

This small puzzle became a bigger puzzle as I read more work, new and old, about prediction, falsification, preregistration, HARKing, and the distinction between exploratory and confirmatory research. It became increasingly clear to me that it is often useful to distinguish between the goal of theory falsification and the goal of Type I error control, and to be clear about exactly what kinds of tools can help achieve each of those goals.

I wrote a short piece about this for PNAS. Here’s what it boils down to:

1. If you have the goal of testing a theory, it can be very useful to preregister directional predictions derived from that theory. In order to say that your study tested a theory, you must be willing to upgrade or downgrade your confidence in the theory in response to the results. If you would trust the evidence if it supported your theory but question it if it contradicted your theory, then it’s not a real test. Put differently, it’s not fair to say that a study provides support for your theory if you wouldn’t have been willing to say (if the results were different) that the same study contradicted your theory.**

2. If you have the goal of avoiding unintentional Type I error inflation so that you can place higher confidence in the results of your study, it can be very useful to preregister an analysis plan in which you carefully specify the various researcher decisions (or “researcher dfs”) that you will encounter as you construct your dataset and analyze your results. If your analyses are data-independent and if you account for multiple testing, you can take your p-values as diagnostic about the likelihood of your results.***

Why do I think this distinction is so important? Because thinking clearly about WHY you are preregistering (a) helps ensure that the way you preregister actually helps achieve your goals and (b) answers a lot of questions that may arise along the way.

Here’s an example of (a): If you want to test a theory, you need to be ready to update your confidence in the theory in EITHER direction, depending on the results of the study. If you can’t specify ahead of time (even at the conceptual level) a pattern of evidence that would reduce your confidence in the theory, the study is not providing a true test...and you don’t get to say that evidence consistent with the prediction provides support for the theory. (For instance: If a study is designed to test the theoretical prediction that expressing prejudice will increase self-esteem, you must be willing to accept the results as evidence against the theory if you find that expressing prejudice decreases self-esteem.)

Here’s an example of (b): You might preregister before running a study and then find unexpected results. If they aren’t very interesting to you, do you need to find a way to publicize them? The answer depends on why you preregistered in the first place. If you had the goal of combatting publication bias and/or theory testing, the answer is definitely YES. But if your goal was solely to constrain your Type I error rate, you’re done—deciding not to publish obviously won’t increase your risk of reporting a false positive.

Read the (very short) PNAS letter here. You can also read a reply from Nosek et al. here, and two longer discussions (Part I and Part II) that I had with the authors about where we agree and disagree, which I will unpack more in subsequent posts.

---

*The best part of this approach? We were no longer motivated to find a particular pattern of results, and could approach the design and analysis of each study in a much more open-minded way. We could take our ego out of it. We get to yell "THAT'S FASCINATING!" when we analyze our data, regardless of the results.

**In philosophy of science terms, tests of theoretical predictions must involve some risk. The risk in question derives from the fact that the theoretical prediction must be “incompatible with certain possible results of observation” (Popper, 1962, p. 36)—it must be possible for the test to result in a finding that is inconsistent with the theoretical prediction.

***DeGroot (2014) has a great discussion of this point.

Monday, October 15, 2018

You say potato, I say: I've been living in a downpour of hailing potatoes and I'm beyond exhausted, so could you please help me hold this umbrella?

Let’s talk about sexism and science. Not the science of sexism, but sexism in science, and what does “sexism” really mean, and what happens when well-intentioned people who all prioritize diversity and inclusion find themselves on opposite sides of an uncomfortable conversation.

Context is important for understanding the impact of any single event, so I’m going to start there, with one context for each side.

Context #1: Once upon a time, a program committee for a conference with arguably an “old-boys network” reputation wanted to improve the diversity of its conference speakers, and so put a ton of effort and time into creating a conference program that was remarkably diverse in many ways and represented a variety of perspectives. For example, in the entire conference program, there was only a single manel, which, if you’ve been following gender imbalances in who gets asked to share their expertise in the academy and beyond, is impressively low. Meanwhile, there were four symposia with all female speakers, which is impressively high.

Context #2: Once upon a time, world history was what it is and sexism (and racism and homophobia and all the rest) were longstanding, entrenched, systemic problems that members of groups disadvantaged by the existing social hierarchy had to confront every day in a dozen small ways and sometimes also very big ways. The Kavanaugh hearings had brought to the fore just how little women’s voices matter in our society. And many women were feeling exhausted…exhausted by thinking almost constantly about the Kavanaugh hearing, exhausted about reliving past experiences of rape and sexual assault and men not taking “no” for an answer. Exhausted by every single tiny and large barb that we navigate on a daily basis, not just in our personal lives but also as professors. The student who asks for our advice on doing better in another class with a male professor because they “don’t want to bother him, he’s such a real professor you know and I wouldn’t want to waste his time,” the committee meeting where male professors introduce themselves to each other but not to us because they assume we’re the secretary there to take notes, the news articles that forget to interview a single woman scientist, the wildly disproportionate service expectations.

In these two contexts, the manel at the conference (the only manel at the conference, and simultaneously one of many manels in a manel-filled world) became the subject of a conversation. And I walked into the conversation and thought to myself, ugh, I am so tired of manels in the world, and I said something like: “those speakers are all smart, lovely, excellent scholars. but blech. i wish dudes would start refusing to be on panels that had such uneven representation. (although i can imagine you sometimes wouldn’t know until you accepted the invite...and yet how cool would it be if white dudes started politely bowing out of all-male panels? ‘i’m so sorry, i just realized that this will be an all-male panel! you don’t need me on here too...i’d suggest asking Dr. X, a super duper expert on this topic, instead!’)”

Now I’m going to move away from the complexities of this specific conversation and just talk about my own experience of it and what I learned. My experience of this conversation was that I intended to express a sentiment about the world (“Sexism in the world sucks! I’d love to see men taking opportunities to push back against it, when they arise!”), and some of my friends who are men heard a sentiment about individuals (“You are sexist!”).

I think that this experience helps highlight a theme that might be useful to all of us when we have conversations about these kinds of issues. I'm going to borrow from Ijeoma Oluo's kickass book to make this point, pulling out a distinction she makes about race and applying it to gender. (Her book is awesome and worth reading, if you haven’t read it yet. Click here to fix that).

Here's the distinction: We can think about sexism at an individual level (e.g., when a senior male professor asks if I can write up my notes on a committee meeting and send them to everyone, is that sexist or something he'd ask anyone? Did a particular panel end up being all men because of individual sexism or did the organizers try hard to find women speakers?). But we can also think about sexism at the systemic level (e.g., is there a broad pattern of men disproportionately asking women to take on invisible and secretarial kinds of service work? Is there a broad pattern of women scholars being relatively overlooked when it comes to invited symposia, speakers, journal articles, etc.?). I suspect that often, when a woman points out an example of something that feels sexist to her, she is trying to point out SYSTEMIC sexism. For example, I was running on a crowded narrow bridge the other day, and a man bicycled through very quickly, and a woman commented: “Of course the men speed through without stopping.” She didn’t mean this (I suspect) as any kind of comment about that particular man or his intentions in that moment. She was pointing out the impact of a broader pattern. She was venting frustration about how time after time after time after time she steps politely to the side while various men in various contexts barrel through.

I’m going to go a step further and speculate that often, when a woman points out an example of something that feels sexist to her, men may hear her comment as being focused on INDIVIDUAL sexism. For example, if the bicyclist overheard the comment about men speeding through, he might have thought to himself “For goodness sakes! That is so unfair! I just didn’t SEE the people on the bridge in time because I was trying to steer around the dog, and I am like the most feminist guy in the world and WTF is up with this random woman accusing me of being sexist!?! I’m one of the good guys!”

Similarly, when I made my comment above expressing frustration with all-male panels and wishing that more white guys would start refusing to be on them, I was expressing a sentiment about the systemic version of this issue and one way to address it. I actually wasn’t thinking much about this specific panel, because one specific panel isn’t a problem—only the broader pattern is a problem—and because the particular people on this particular panel were all awesome social justice warriors in various ways and so it really wouldn’t occur to me in a million years to call them sexist at an individual level.

If I’m right—if women often see and try to call out systemic sexism, but men often hear and (therefore quite reasonably feel defensive about) claims of individual sexism, then the question becomes how to have productive and respectful conversations about sexism without being derailed by miscommunication.

One answer is that it is up to both sides to work harder to clarify what they mean: Women need to make the extra effort to clarify when they are talking about a broad pattern and not any one particular individual or instance, and to recognize the individual efforts that the individual sitting across from them in the conversation may have put into fighting discrimination and improving inclusion already. And men need to make the extra effort to notice when they are feeling defensive, let those feelings go, recognize that the conversation might be more productive if they can hear it as being about a broad pattern rather than any particular individual behavior, and reflect on how they might do more to help combat the broad pattern in the future.

But I would like to be provocative here and suggest a different answer. I would like to challenge the men reading this to make more than 50% of the effort in conversations like these. I would like to challenge you to make 60% or 80% or even 100% of the effort when necessary. Why? Because you have privilege in this context. And that privilege comes with secret stashes of extra bandwidth. Bandwidth that you didn’t spend nodding and smiling at the student who implied you weren’t a real professor. Bandwidth that you didn’t spend noticing that another newspaper article didn’t interview a single scientist of your gender. Bandwidth that you didn’t spend reserving a room or finding an article when a senior male colleague explained that he just didn’t understand how to deal with the online system and could you please just do it for him. Bandwidth that you didn’t spend editing that email for the fourth time to make sure it wouldn’t come across as too combative. Bandwidth that you didn’t spend pouring hours into conversation after conversation where you try to very patiently explain your lived experience to men who don’t understand or fully try to understand it in various ways. Bandwidth that you could, if you so choose, spend in listening when women take the time and energy to point out systemic sexism and respond without defensiveness, with a simple “I am listening.” Bandwidth that you could use to amplify women’s voices with your own voice that is louder and further reaching in this context.

Meanwhile, white women reading this: I have a challenge for us, as well. We have privilege in so many other contexts. And that privilege comes with secret stashes of extra bandwidth. And the next time we’re sitting across from someone, virtually or in person, who says “here is my lived experience as a person of color”—let’s just sit there and LISTEN. Even if we want to say “but I’m one of the good ones!” Even if we are tempted to explain something about our own perspective in response. Instead, let’s remember that screamy feeling in our chest, the exhaustion in our heads and hearts these days. Let’s tap into the secret bandwidth stash we bring to conversations about race, and let’s and respond with “I hear you. I am listening.” And let’s amplify other people’s voices with our own that are louder and further reaching in this context.

Monday, March 5, 2018

Preregistration is a Hot Mess. And You Should Probably Do It.

There’s this episode of Family Guy where some salesman offers Peter a choice between two prizes: a boat or a mystery box. Peter can’t help himself—he reaches for the mystery box. “A boat's a boat, but a mystery box could be ANYTHING!” he says. “It could even be a boat!”

Mystery boxes are like that. Full of mystery and promise. We imagine they could be anything. We imagine them as their best selves. We imagine that they will Make Life Better, or in science, that they will Make Science Better.

Lately, I’ve been thinking a lot about what’s inside the mystery box of preregistration.

Why do I think of preregistration as a mystery box? I’ll talk about the first reason now, and come back to the second in a later post. Reason the First is that we don’t know what preregistration is or why we’re doing it. Or rather, we disagree as a field about what the term means and what the goals are, and yet we don’t even know there’s disagreement. Which means that when we talk about preregistration, we’re often talking past each other.

Here are three examples that I gave in my talk last week at SPSP.

Back in the olden days, in June 2013, Chris Chambers and colleagues published an open letter in the Guardian entitled “Trust in Science Would be Improved by Preregistration." They wrote: “By basing editorial decisions on the question and method, rather than the results, pre-registration overcomes the publication bias that blocks negative findings from the literature.” In other words, they defined preregistration as peer-review and acceptance of articles before the results are known, and they saw the major goal or benefit of preregistration as combatting publication bias.

Fast forward to December 2017: Leif Nelson and colleagues publish a blog post on “How to Properly Preregister a Study.” They define preregistration as “time-stamped documents in which researchers specify exactly how they plan to collect their data and to conduct their key confirmatory analyses,” and they explain that the goal is “to make it easy to distinguish between planned, confirmatory analyses…and unplanned exploratory analyses.”

Then just a few weeks ago, out of curiosity, I posted the following question on PsychMAP: A researcher tweets: "A new preregistered study shows that X affects Y." What would you typically infer about this study? By far the most popular answer (currently at 114 votes, compared to the next most popular of 44) was that the main hypothesis was probably a priori (no HARKing). In other words, many psychologists seem to think preregistration means writing down your main hypothesis ahead of time, and perhaps that a primary goal is to avoid HARKing.

So if I tell you that I preregistered my study, what do I mean? And why did I do it—what was my goal?

I think we are in desperate need of a shared and precise language to talk about preregistration. Just like any other scientific construct, we’re not going to make much headway on understanding it or leveraging it if we don’t have a consensual definition of what we are talking about.

It seems to me that there are (at least) four types of preregistration, and that each one serves its own potential function or goal. These types of preregistration are not mutually exclusive, but doing any one of them doesn’t necessarily mean you’re doing the others.

Notice that I said POTENTIAL benefit or function. That’s crucial. It means that depending on how you actually implement your preregistration, you may or may not achieve the ostensible goal of that type of preregistration. Given how important these goals are for scientific progress, we need to be paying attention to how and when preregistration can be an effective tool for achieving them.

Let’s say you want to combat publication bias, so you preregister your study on AsPredicted or OSF. Who is going to be looking for your study in the future? Will they be able to find it, or might the keywords you’re using be different from the ones they would search for? Will they be able to easily find and interpret your results? Will you link to your data? If so, will a typical researcher be able to understand your data file at a glance, or did you forget to change the labels from, say, the less-than-transparent VAR1043 and 3REGS24_2?

Let’s say you have a theory that’s ready to be tested.* So you record your hypothesis ahead of time: “Stereotyping will increase with age.” But what kind of stereotypes? Measured how? The vagueness of the prediction leaves too much wiggle room for HARKing later on—and here I mean HARKing in the sense of retroactively fitting the theory to whatever results you happen to find. If you find that racial stereotypes get stronger but elderly stereotypes get weaker as people age, your vague a priori prediction leaves room for your motivated human mind to think to itself “well of course, the prediction doesn’t really make sense for stereotypes of the elderly, since the participants are themselves elderly.” Obviously, adjusting your theory in light of the data is fine if you’re theory building (e.g., asking “Do various stereotypes increase with age, and if so, when?”), but you wouldn’t want to do it and then claim that you were testing a theory-derived hypothesis in a way that allowed for the theory to be falsified. [Update: As Sanjay Srivastava pointed out in a PsychMAP discussion about this post, it's important to recognize that often, researchers wishing to provide a strong and compelling test of a theory will want to conduct a preregistration that combines elements 2 and 3—that is, specific, directional predictions about particular variables that are clearly derived from theory PLUS clear a priori constraints on researcher degrees of freedom.]

Or let’s say that want to be able to take your p-value as an indicator of the strength of evidence for your effect, in a de Grootian sense, and so you preregister a pre-analysis plan. If you write down “participants have to be paying attention,” it doesn’t clearly constrain flexibility in data analysis because there are multiple ways to decide whether participants were paying attention (e.g., passing a particular attention check vs. how long participants spent completing the survey vs. whether a participant clicked the same number for every item on a survey). If you want to avoid unintentional researcher degrees of freedom (or “HARKing” in the sense of trying out different researcher decisions and selecting only the ones that produce the desired results), you need to clearly and completely specify all possible researcher decisions in a data-independent way.**

In fact, the registered report is really the only kind of preregistration on here that’s straightforward to implement in an effective way, because much of the structure of how to do it well has been worked out by the journal and because the pieces of the preregistration are being peer reviewed.

Which brings me to the second reason why I call preregistration a mystery box: What’s inside a preregistration when it HASN’T been part of a peer reviewed registered report? Maybe not what you would assume. Stay tuned.

---

*Many of us don’t do a lot of theory testing, by the way—we might be collecting data to help build theory, or asking a question that’s not very theoretical but might later end up connecting speculatively to some theories in potentially interesting ways (the sort of thing you do in a General Discussion)—but we’re not working with a theory that generates specific, testable predictions yet.

**Yeah, so, we’ve been using “HARKing” to mean two things…sometimes we use it to mean changing the theory to fit the results, which hampers falsifiability, and sometimes we use it to mean reporting a result as if it’s the only one that was tested, which hampers our ability to distinguish between exploratory and confirmatory analyses. (In his 1998 article, Kerr actually talked about multiple variants of HARKing and multiple problems with it.)***

***We’ve also been using “exploratory/confirmatory” to distinguish between both exploratory vs. confirmatory PREDICTIONS (do you have a research question or a directional prediction?) and exploratory vs. confirmatory ANALYSES (are your analyses data-dependent or data-independent/selected before seeing the data).****

****Did I mention that our terminology is a hot mess?

Tuesday, January 16, 2018

You're Running out of Time

Maybe you’ve been meaning to get to it. Maybe you keep thinking, I just don’t have time right now, but soon, as soon as I submit this paper, as soon as I finish teaching this class. Maybe you’re waiting for it to blow over. Maybe it feels like a choice between work and rubbernecking to watch some kind of field-wide car crash, and you’ve been choosing to work.

Or maybe it seems like a social psychology problem, and you’re not in that area, so it doesn’t even apply to you. In any case, business as usual. Onward and upward, publish or perish, keep on moving, nothing to see here.

Here’s the problem, though.

You’re running out of time.

This pesky “crisis” thing? It isn’t going away.* It isn’t limited to one area of psychology, or even just psychology. It’s not something you can ignore and let other people deal with. And it isn’t even something you can put off grappling with in your own work for just another month, semester, year, two years. The alarm bells have been sounded—alarms bells about replicability, power, and publication bias—and although these concerns have been raised before and repeatedly, a plurality of scholars across scientific disciplines are finally listening and responding in a serious way.

Now, it takes time to change your research practices. You have to go out and learn about the problems and proposed solutions, you have to identify which solutions make sense for your own particular research context, you have to learn new skills and create new lab policies and procedures. You have to think carefully about things like power (no, running a post-hoc power analysis to calculate observed power is not a good idea) and preregistration (like why do you want to preregister and which type of preregistration will help you accomplish your goals?), and you probably have to engage in some trial and error before you figure out the most effective approaches for your lab.

So a few years ago, when someone griped to me about seeing a researcher present a conference talk with no error bars in the graphs, I nodded sympathetically but also expressed my sense that castigating the researcher in question was premature. Things take awhile to percolate through the system. Not everybody hears about this stuff right away. It might take people awhile to go back through every talk and find every dataset and add error bars. Let’s have some patience. Let’s wait for things to percolate. Let’s give people a chance to learn, and try new things, and improve their research practices going forward, and let’s give that research time to make its slow way through the publication process and emerge into our journals.

Now, though? It’s 2018. And you’re submitting a manuscript where you interpret p = .12 on one page as “a similar trend emerged” that is consistent with your hypothesis, and on another page you use another p = .12 to conclude that “there were no differences across subsamples, so we do not investigate this variable further”…or you’re writing up a study where you draw strong conclusions from the lack of a significant difference on a behavioral outcome between 5 year olds and 7 year olds, with a grand total of 31 children per group and no discussion of the limited reliability of your measure?

Or you’re giving a talk…a new talk, about new data…and you haven’t put error bars on your graphs? And for your between-subjects interaction…for a pretty subtle effect …you collected TWENTY people per cell? And you don’t talk about power at all when you’re describing the study? Or the next study? Or the next?

Well now you’ve lost me. I’m looking out the window. I’m wondering why I’m here. Or actually, I’m wondering why YOU’RE here. Why are you here?

Are you here to science?

Well then. It’s time to pay attention.

Here is one good place to start.**

*Note, I'm not here to debate how bad the replicability crisis is. Lots of other people seem to find value in doing that, but I'm personally more interested in starting with a premise we can all agree on -- i.e., that there's always room for improvement -- and making progress on those improvements.

**And let me just emphasize that word start. I'm not saying you're out of time to finish making all improvements to your research methods and practices -- in fact, I see improving methods and practices as a process that we can incorporate into our ongoing research life, not something that really gets finished. Again, nothing is ever perfect...we can always be looking for the next step. But I do think it's time for EVERYONE to be looking for, and implementing, whatever that next step is in their own particular research context. If you find that you're still on the sidelines -- get in the game. This is not something to watch and it's not something to ignore. It's something you need to be actively engaged in.

Tuesday, November 14, 2017

Walking and Talking

I’m going to say something, and you’re not going to like it: It’s a hell of a lot easier, these days, to talk the talk than to walk the walk.

I mean this in at least three different ways.

1. Low-Cost Signals vs. High-Cost Actions
It is far easier to extol publicly the importance of changing research practices than to actually incorporate better practices in your own research. I can put together a nice rant in about ten minutes about how everyone should preregister and run highly powered studies and replicate before publishing…and while you’re at it, publish slower, prioritize quality over quantity, don't cherry-pick, and post all your data!

But actually thinking through how to increase the informational value of my own research, learning the set of skills necessary to do so, and practicing what I preach? Well that’s far more time-consuming, effortful, and costly.

For example. Let’s say you have a paper. It’s totally publishable, probably at a high impact journal. It reports some findings that for whatever reason, you’re not super confident about. Do you publish it? All together now: “No way!” (See? Talk is easy.)

But do you ACTUALLY decide against publishing it? Because if you do (and I have, repeatedly), your publication count and citation indices take a hit. And your coauthors’ counts and indices take a hit. And now your bean count is lower than it might otherwise be in a system that still prioritizes beans at many levels.

“Down with the beans!” you say. “Let’s change the incentive structure of science!” Awesome. Totally agree. And saying this is easy. Do you actually do it? Do you go into the faculty meeting and present a case for hiring or promoting someone WITHOUT counting any beans? And do you do this REGARDLESS of whether or not the bean count looks impressive? Because it’s tempting to only bother with the longer, harder quality conversation if the quantity isn’t there. And, if you do focus the conversation exclusively on quality, someone is likely to ask you to count the beans anyway. In fact, even if you are armed with an extensive knowledge of the quality of the candidate’s papers and a compelling case for why quality matters, you are going to have an uphill battle to convince the audience to prioritize quality over quantity—especially if those audience members come from areas of psychology that have not yet had to grapple seriously with issues of replicability and publication bias.

Or maybe you say “yes, publish that paper with the tentative findings—just be transparent about your lack of confidence in the results! At the end of the day, publish it all…just be sure to distinguish clearly between exploratory (data-dependent) and confirmatory (data-independent) findings!” Totally agree. And again: Talk is easy. When you submit your paper, do you clearly state the exploratory nature of your findings front and center, so that even casual readers are sure to see it? If it’s not in the abstract, most people are likely to assume the results you describe are more conclusive than they actually are. But if you put it front and center, you may dramatically lower the chances that your paper gets accepted. (I haven’t tried this one yet, for exactly this reason…instead, I’ve been running pre-registered replications before trying to publish exploratory results. But again, that’s far easier to advocate than to actually do, especially when a study requires substantial resources.)

2. Superficial Shortcuts vs. Deep Thinking
It’s far easier to say you’ve met some heuristic rules about minimum sample sizes, sharing your data, or preregistering than it is to learn about and carefully think through each of these practices, what their intended consequences are, and how to actually implement them in a way that will achieve their intended consequences.

For example. I can upload my data file to the world wide internets in about 60 seconds. Whee, open data! But how open is it, really? Can other researchers easily figure out what each variable is, how the data were processed, and how the analyses were run? Clearly labeling your data and syntax, providing codebooks, making sure someone searching for data like yours will be able to find it easily--all of these things take (sometimes considerable) time and care…but without them, your “open data” is only open in theory, not practice.

Likewise, I can quickly patch together some text that loosely describes the predictions for a study I’m running and post it on OSF and call it a preregistration. But I’m unlikely to get the intended benefits of preregistration unless I understand that there are multiple kinds of preregistration, that each kind serves a different goal, and what it takes to achieve the goal in question. Likewise, I can read a tweet or Facebook post about a preregistered paper and decide to trust it (“those keywords look good, thumbs up!”), or I can go read everything critically and carefully. Equating preregistered studies with good science is easy, and we’ve done this kind of thing before (p < .05! Must be true!). Going beyond the heuristic to think critically about what was in the preregistration and what that means for how much confidence we should have in the results…that’s much harder to do.

3. Countering a Norm in Words vs. Deeds
Now, you might be thinking: It is NOT easy to talk the talk when you’re in the [career stage, institution, or environment] that I am in! And that may be very true. But of course, even here, the talk is still easier than the walk. Talking to your adviser about the merits of clearly distinguishing between data-dependent and data-independent analyses in a paper may be a challenge…but actually convincing them to agree to DO it is probably harder. Publicly stating that you’ve preregistered something may have costs if the people around you think preregistration is silly. But asking yourself why you’re preregistering—what you hope to gain (theory falsification? Type I error control?) and how to actually implement the preregistration in a way that gets you those benefits—that’s an extra layer of effort.

So what’s the point of all this talking that I’m doing right now? It’s to acknowledge that change is hard, and costly. Anyone who tells you this is easy—that there are no tradeoffs, that we just need some new simple decision rules to replace the old ones—is oversimplifying. They are talking but not walking, in the second sense above.

But this post full of talking is also meant to be a challenge to all the talkers out there, and to myself as well. Are you really (am I really) doing what you’re advocating to the fullest extent that you can? The answer is probably no.
So: What’s the next step?

Monday, June 26, 2017

Who to Invite for Your Next Methods and Practices Symposium

Planning a symposium or panel on methods and practices in psychology? Here's a collection of top notch speakers to consider inviting.* Inspired by a recent post on PsychMAP as well as #womenalsoknowstuff—not to mention the frequency with which people ask me to recommend female speakers because they can't think of any—these are all women. So now there is no excuse for the 100% male panel on the subject. In fact, you could easily have a 100% female panel of stellar experts (and it's been done! exactly once, as far as I know). Keep in mind that many of these scholars could also be excellent contributors to special issues and handbooks on methods and practices topics.

Here are names and institutions for potential speakers across a range of career stages. These scholars can all speak to issues that relate to our field's unfolding conversations and debates about replicability and improving research methods and practices. When possible, I've linked the name to a relevant publication as well so that you can get a sense of some of their work.

(And of course, this list is incomplete. If you or someone you know should be on it, please leave a comment with the scholar's name, position, institution, relevant speaking topics, and a link to a relevant paper if applicable!)

Samantha Anderson, PhD student, University of Notre Dame
Statistical power, replication methodology, more nuanced ways to determine the "success" of a replication study

Jojanneke Bastiaansen, Postdoc, Groningen
Citation distortion, bias in reporting

Christina Bergmann, Max Planck Institute Nijmegen, The Netherlands
Crowd-sourced meta-analyses, open science, improving research practices in infancy research

Dorothy Bishop, Professor, Oxford
Reproducibility, open science

Erin Buchanan, Associate Professor, Missouri State University
Effect sizes and confidence intervals, alternatives to NHST, Bayesian statistics, statistical reporting

Katherine Button, Lecturer, University of Bath
Power estimation, replicability

Krista Byers-Heinlein, Associate Professor, Concordia University
Organizing large multi-lab collaborative studies and RRRs (she leads the ManyBabies Bilingual project, an RRR at AMPPS currently in data collection), working with hard-to-recruit/hard-to-test/hard-to-define populations (bilingual infants), and making sure the media gets your science right.

Katie Corker, Assistant Professor, Grand Valley State University
Meta-analysis, replication, perspectives on open science from teaching institutions

Angelique Cramer, Associate Professor, Tilburg University
Slow science, open science, exploratory vs. confirmatory hypothesis testing, hidden multiple-testing issues in ANOVA, replication issues in the context of psychopathology research

Alejandrina Cristia, Researcher, Ecole Normale Supérieure
Crowd-sourced meta-analyses, research practices in infancy research

Pamela Davis-Kean, Professor, University of Michigan
Large developmental data sets, replication

Elizabeth Dunn, Professor, University of British Columbia
Pre-registration, how researchers think about Bayes Factors, the NHST debate

Arianne Eason, PhD student, University of Washington
Research practices in infancy research

Ellen Evers, Assistant Professor, University of California, Berkeley
Statistical power, reliability of published work

Fernanda Ferreira, Professor, UC Davis
Open science, open access, replication, how to design appropriate replication studies when original studies involve stimuli that may be specific to certain time periods or contexts (e.g., words used in an experiment in psycholinguistics)

Jessica Flake, Postdoc, York University
Construct validation, measurement, instrument design

Susann Fiedler, Research Group Leader, Max Planck Institute for Research on Collective Goods, Bonn, Germany
Economics and ethics of science, reproducibility, publication bias, incentive structures, digital scholarship and open science

Shira Gabriel, Associate Professor, SUNY Buffalo
Editor perspective on changes in the field and implementing new ideas in journals

Kiley Hamlin, Associate Professor, University of British Columbia
How to improve methods when you study hard-to-recruit populations; personal experiences with the dangers of failing to document everything and how to prevent this problem in your own lab.

Erin Hennes, Assistant Professor, Purdue University
Simulation methods for power analysis in complex designs

Ase Innes-Ker, Senior Lecturer, Lund University
Open science, replication, peer review

Deborah Kashy, Professor, Michigan State University
Reporting practices, transparency

Melissa Kline, Postdoc, MIT
Improving practices in infancy research

Alison Ledgerwood, Associate Professor, UC Davis
Practical best practices; how to design a study to maximize what you learn from it (strategies for maximizing power, distinguishing exploratory and confirmatory research); how to learn more from exploratory analyses; promoting careful thinking across the research cycle.

Carole Lee, Associate Professor, University of Washington
Philosophy of science, peer review practices, publication guidelines

Dora Matzke, Assistant Professor, University of Amsterdam
Bayesian inference

Michelle Meyer, Assistant Professor and Associate Director, Center for Translational Bioethics and Health Care Policy at Geisinger Health System
Topics related to responsible conduct of research, research ethics, or IRBs, including ethical/policy/regulatory aspects of replication, data preservation/destruction, data sharing and secondary research uses of existing data, deidentification and reidentification, and related IRB and consent issues.

Kate Mills, Postdoc, University of Oregon
Human neuroscience open data, multi-site collaboration

Lis Nielson, Chief, Individual Behavioral Processes Branch, Division of Behavioral and Social Research, NIH
Improving reproducibility, validity, and impact

Michèle Nuijten, PhD student, Tilberg University
Replication, publication bias, statistical errors, questionable research practices

Elizabeth Page-Gould, Associate Professor, University of Toronto
Reproducibility in meta-analysis

Jolynn Pek, Assistant Professor, York University
Quantifying uncertainties in statistical results of popular statistical models and bridging the gap between methodological developments and their application.

Cynthia Pickett, Associate Professor, UC Davis
Changing incentive structures, alternative approaches to assessing merit.

Julia Rohrer, Fellow, Deutsches Institut Für Wirtschaftsforschung, Berlin
Metascience, early career perspective on replicability issues

Caren Rotello, Professor, UMass Amherst
Measurement issues, response bias, why replicable effects may nevertheless be erroneous.

Victoria Savalei, Associate Professor, University of British Columbia
The NHST debate, how people reason about and use statistics and how this relates to the replicability crisis, how researchers use Bayes Factors.

Anne Scheel, PhD student, Ludwig-Maximilians-Universität, Munich
Open science, pre-registration, replication issues from a cognitive and developmental psychology perspective, early career perspective

Linda Skitka, Professor, University of Illinois at Chicago
Empirically assessing the status of the field with respect to research practices and evidentiary value; understanding perceived barriers to implementing best practices.

Courtney Soderberg, Statistical and Methodological Consultant, Center for Open Science
Pre-registration and pre-analysis plans, sequential analysis, meta-analysis, methodological and statistical tools for improving research practices.

Jessica Sommerville, Professor, University of Washington
Research practices in infancy research.

Jehan Sparks, PhD student, UC Davis
Practical strategies for improving research practices in one's own lab (e.g., carefully distinguishing between confirmatory and exploratory analyses in a pre-analysis plan).

Barbara Spellman, Professor, University of Virginia
Big-picture perspective on where the field has been and where it’s going; what editors can do to improve the field; how to think creatively about new ideas and make them happen (e.g., RRRs at Perspectives on Psychological Science)

Sara Steegen, PhD student, University of Leuven, Belgium
Research transparency, multiverse analysis

Victoria Stodden, Associate Professor, University of Illinois at Urbana-Champaign
Enabling reproducibility in computational science, developing standards of openness for data and code sharing, big data, privacy issues, resolving legal and policy barriers to disseminating reproducible research.

Jennifer Tackett, Associate Professor, Northwestern
Replicability issues in clinical psychology and allied fields

Sho Tsuji, Postdoc, UPenn and LSCP, Paris
Crowd-sourced meta-analysis

Anna van t'Veer, Postdoc, Leiden University
Pre-registration, replication

Simine Vazire, Associate Professor, UC Davis; Co-founder, Society for the Improvement of Psychological Science (SIPS)
Replication, open science, transparency

Anna de Vries, PhD student, Groningen
Citation distortion, bias in reporting, meta-analysis

Tessa West, Associate Professor, NYU
Customized power analysis, improving inclusion in scientific discourse

Edit (6/27/17): Note that this list doesn't even try to cover the many excellent female scholars who could speak on quantitative methods more broadly—I will leave that to someone else to compile (and if you take this on, let me know and I'll link to it here!). In this list, I'm focusing on scholars who have written and/or spoken about issues like statistical power, replication, publication bias, open science, data sharing, and other topics related to core elements of the field's current conversations and debates about replicability and improving research practices (i.e., the kinds of topics covered on this syllabus).

Thursday, June 15, 2017

Guest Post: Adjusting for Publication Bias in Meta-Analysis - A Response to Data Colada [61]

A recent blogpost on Data Colada raises the thorny but important issue of adjusting for publication bias in meta-analysis. In this guest post, three statisticians weigh in with their perspective.

Datacolada Post [61] Why p-curve excludes ps>.05

Response of Blakeley B. McShane, Ulf Böckenholt, and Karsten T. Hansen

The quick version:
Below, we offer a six-point response to the recent blogpost by Simonsohn, Simmons, Nelson (SSN) on adjusting for publication bias in meta-analysis (or click here for a PDF with figures). We disagree with many of the points raised in the blogpost for reasons discussed in our recent paper on this topic [MBH2016]. Consequently, our response focuses on clarifying and expounding upon points discussed in our paper and provides a more nuanced perspective on selection methods such as the three-parameter selection model (3PSM) and the p-curve (a one-parameter selection model (1PSM)).

We emphasize that all statistical models make assumptions, that many of these are likely to be wrong in practice, and that some of these may strongly impact the results. This is especially the case for selection methods and other meta-analytic adjustment techniques. Given this, it is a good idea to examine how results vary depending on the assumptions made (i.e., sensitivity analysis) and we encourage researchers to do precisely this by exploring a variety of approaches. We also note that it is generally good practice to use models that perform relatively well when their assumptions are violated. The 3PSM performs reasonably well in some respects when its assumptions are violated while the p-curve does not perform so well. Nonetheless, we do not view the 3PSM or any other model as a panacea capable of providing a definitive adjustment for publication bias and so we reiterate our view that selection methods—and indeed any adjustment techniques—should at best be used only for sensitivity analysis.

The full version:

Note: In the below, “statistically significant” means “statistically significant and directionally consistent” as in the Simonsohn, Simmons, Nelson (SSN) blogpost. In addition, the “p-curve” refers to the methodology discussed in SNS2014 that yields a meta-analytic effect size estimate that attempts to adjust for publication bias.(1)

Point 1: It is impossible to definitively adjust for publication bias in meta-analysis

As stated in MBH2016, we do not view the three-parameter selection model (3PSM) or any other model as a panacea capable of providing a definitive adjustment for publication bias. Indeed, all meta-analytic adjustment techniques—whether selection methods such as the 3PSM and the p-curve or other tools such as trim-and-fill and PET-PEESE—make optimistic and rather rigid assumptions; further, the adjusted estimates are highly contingent on these assumptions. Thus, these techniques should at best be used only for sensitivity analysis.

[For more details in MBH2016, see the last sentence of the abstract; last paragraph of the introduction; point 7 in Table 1; and most especially the entire Discussion.]

Point 2: Methods discussions must be grounded in the underlying statistical model

All statistical models make assumptions. Many of these are likely to be wrong in practice and some of these may strongly impact the results. This is especially the case for selection methods and other meta-analytic adjustment techniques. Therefore, grounding methods discussions in the underlying statistical model is incredibly important for clarity of both thought and communication.

SSN argue against the 3PSM assumption that, for example, a p=0.051 and p=0.190 study are equally likely to be published; we agree this is probably false in practice. The question, then, is what is the impact of this assumption and can it be relaxed? Answer: it is easily relaxed, especially with a large number of studies. We believe the p-curve assumptions that (i) effect sizes are homogenous, (ii) non-statistically significant studies are entirely uninformative (and are thus discarded), and (iii) a p=0.049 study and a p=0.001 study are equally likely to be published are also doubtful. Further, we know via Jensen’s Inequality that the homogeneity assumption can have substantial ramifications when it is false—as it is in practically all psychology research.

[For more details in MBH2016, see the Selection Methods and Modeling Considerations sections for grounding a discussion in a statistical model and the Simulation Evaluation section for the performance of the p-curve.]

Point 3: Model evaluation should focus on estimation (ideally across a variety of settings and metrics)

SSN’s simulation focuses solely on Type I error—a rather uninteresting quantity given that the null hypothesis of zero effect for all people in all times and in all places is generally implausible in psychology research (occasional exceptions like ESP notwithstanding). Indeed, we generally expect effects to be small and variable across people, times, and places. Thus, “p < 0.05 means true” dichotomous reasoning is overly simplistic and contributes to current difficulties in replication. Instead, we endorse a more holistic assessment of model performance—one that proceeds across a variety of settings and metrics and that focuses on estimation of effect sizes and the uncertainty in them. Such an evaluation reveals that the 3PSM actually performs quite well in some respects—even in SSN’s Cases 2-5 and variants thereof in which it is grossly misspecified (i.e., when its assumptions are violated; see Point 6 below).

[For more details in MBH2016, see the Simulation Design and Evaluation Metrics subsection.]

Point 4: The statistical model underlying the p-curve is identical to the model of Hedges, 1984 [H1984]

Both the p-curve and H1984 are one-parameter selection models (1PSM) that make identical statistical assumptions: effect sizes are homogenous across studies and only studies with results that are statistically significant are “published” (i.e., included in the meta-analysis). Stated another way, the statistical model underlying the two approaches is 100% identical and hence if you accept the assumptions of the p-curve you therefore accept the assumptions of H1984 and vice versa.

The only difference between the two methods is how the single effect size parameter is estimated from the data:

H1984 uses principled maximum likelihood estimation (MLE) while p-curve minimizes the Kolmogorov-Smirnov (KS) test statistic. As MLE possesses a number of mathematical optimality properties; easily generalizes to more complicated models such as the 3PSM (as well as others even more complicated); and yields likelihood values, standard errors, and confidence intervals, it falls on SSN to mathematically justify why they view the proposed KS approach to be superior to MLE for psychology data.(2)

[For more details in MBH2016, see the Early Selection Methods and p-methods subsections.]

Point 5: Simulations require empirical and mathematical grounding

For a simulation to be worthwhile (i.e., in the sense of leading to generalizable insight), the values of the simulation parameters chosen (e.g., effect sizes, sample sizes, number of studies, etc.) and the data-generating process must reflect reality reasonably well. Further still, there should ideally be mathematical justification of the results. Indeed, with sufficient mathematical justification a simulation is entirely unnecessary and can be used merely to illustrate results graphically.

The simulations in MBH2016 provide ample mathematical justification for the results based on: (i) the optimal efficiency properties of the maximum likelihood estimator (MLE; Simulation 1), (ii) the loss of efficiency resulting from discarding data (Simulation 2), and (iii) the bias which results from incorrectly assuming homogeneity as a consequence of Jensen’s Inequality (Simulation 3). We remain uncertain about the extent to which Cases 2-5 of the SSN simulations reflect reality and thus seek mathematical justification for the generalizability of the results. Nonetheless, they seem of value if viewed solely for the purpose of assessing the 3PSM model estimates when that model is misspecified.

[For more details in MBH2016, see the Simulation Evaluation section.]

Point 6: The 3PSM actually performs quite well in SSN’s simulation—even when misspecified.

Only in Case 1 of the SSN simulation is the 3PSM properly specified (and even this is not quite true as the 3PSM allows for heterogeneity but the simulation assumes homogeneity). SSN show that when the 3PSM is misspecified (Cases 2-5), its Type I error is far above the nominal α=0.05 level. We provide further results in the figures here.

• The blue bars in the left panel of Figure 1 reproduce the SSN result. We also add results for the 1PSM as estimated via KS (p-curve) and MLE (H1984). As can be seen, the Type I error of the 1PSM MLE remains calibrated at the nominal level. In the right panel, we plot estimation accuracy as measured by RMSE (i.e, the typical deviation of the estimated value from the true value). As can be seen, the 3PSM is vastly superior to the two 1PSM implementations in some cases and approximately equivalent to them in the remaining ones.

• In Figure 2, we change the effect size from zero to small (d=0.2); the 3PSM has much higher power and better estimation accuracy as compared to the two 1PSM implementations.

• In Figure 3, we return to zero effect size but add heterogeneity (τ=0.2). The 1PSM has uncalibrated Type I error for all cases while the 3PSM remains calibrated in Case 1; in terms of estimation accuracy, the 3PSM is vastly superior to the two 1PSM implementations in some cases and approximately equivalent to them in the remaining ones.(3)

• In Figure 4, we change the effect size from zero to small and add heterogeneity. The 3PSM generally has similar power and better estimation accuracy as compared to the two 1PSM implementations (indeed, only in Case 1 does the 1PSM have better power but this comes at the expense of highly inaccurate estimates).

In sum, the 3PSM actually performs quite well compared to the two 1PSM implementations—particularly when the focus is on estimation accuracy as is proper; this is especially encouraging given that the 1PSM is correctly specified in all five cases of Figures 1-2 while the 3PSM is only correctly specified in Case 1 of the figures. Although these results favor the 3PSM relative to the two 1PSM implementations, we reiterate our view that selection methods—and indeed any adjustment techniques—should at best be used only for sensitivity analysis.

Click here for the PDF with all figures.

Footnotes

(1) The same authors have developed a distinct methodology also labelled p-curve that attempts to detect questionable research practices. This note does not comment on that methodology.

(2) Both MLE and KS are asymptotically consistent and thus asymptotically equivalent for the statistical model specified here. Consequently, any justification will likely hinge on small sample properties which can be mathematically intractable for this class of models. Justifications based on robustness to model specification are not germane here because if a different specification deemed more appropriate, the model would be re-specified according to this more appropriate specification and that model estimated.

(3) A careful reading of SNS2014 reveals that the p-curve is not meant to estimate the population average effect size. As shown here and in MBH2016, it cannot as no 1PSM can. This is important because we believe that the heterogeneous effect sizes (i.e., τ > 0) are the norm in psychology research.

References

[H1984] Hedges, L. V. (1984). Estimation of effect size under nonrandom sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational and Behavioral Statistics, 9, 61–85.

[MBH2016] McShane, B.B., Böckenholt, U., and Hansen, K.T. (2016), “Adjusting for Publication Bias in Metaanalysis: An Evaluation of Selection Methods and Some Cautionary Notes.” Perspectives on Psychological Science, 11(5), 730-749.

[SNS2014] Simonsohn,U., Nelson, L.D. and Simmons, J.P. (2014) “p-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Result”, Psychological Science, 2014, Vol.9(6), 666-681.