Incurably Nuanced

Thursday, November 8, 2018

Would preregistration speed or slow progress in science? A debate with Richard Shiffrin.

Lately, I've been arguing that preregistering focal predictions and pre-analysis plans can provide useful tools that enable scientists to more effectively and efficiently calibrate their confidence in theory and study results, respectively, thereby facilitating progress in science. But not everyone agrees. Recently, I had the opportunity to debate the question of whether preregistration would speed or slow progress in science with Rich Shiffrin, a professor in the Department of Psychological and Brain Sciences at Indiana University who comes to these issues with a very smart, thoughtful, and different perspective from the one I hold. I wanted to share the discussion with you (with his permission), in case you'd like to join us in thinking through some of the complexities of these issues.

RS: Having run a number of colloquia and symposia on reproducibility, and running one at Psychonomics this fall (and having published in PNAS a paper addressing this issue), I read your editorial with some interest. Although not objecting to anyone who chooses to pre-register a study, I would not like to see it in general use, because I think it would slow progress in science. There are many reasons for my views, and they won’t fit in an email. But some are related to the theme of your editorial and the reply, and to the title of my upcoming symposia: “Should statistics govern the practice of science, or science govern the practice of statistics?” It is my feeling that the pre-supposition you make and Nosek also makes is in error concerning the way science works, and has always worked. For example, you list two views of pre-registration:

i) Have these data influenced my theoretical prediction? This question is relevant when researchers want to test existing theory: Rationally speaking, we should only adjust our confidence in a theory in response to evidence that was not itself used to construct the theoretical prediction in question (3). Preregistering theoretical predictions can help researchers distinguish clearly between using evidence to inform versus test theory (3, 5, 6).

ii) Have these data influenced my choice of statistical test (and/or other dataset-construction/analysis decisions)? This question is relevant when researchers want to know the type I error rate of statistical tests: Flexibility in researcher decisions can inflate the risk of false positives (7, 8). Preregistration of analysis plans can help researchers distinguish clearly between data-dependent analyses (which can be interesting but may have unknown type I error) and data-independent analyses (for which P values can be interpreted as diagnostic about the likelihood of a result; refs. 1 and 9).

I argue and continue to argue that science is almost always post-hoc: That science is driven by data and by hypotheses, by models and theories we derive once we see the data, and that progress is generally made when we see data we do not anticipate. What your points i) and ii) pre-suppose is that there is a problem with basing our inferences on the data after we see the data, and with developing our theories based on the data we see. But this is the way science operates. Of course scientists should admit that this is the case, rather than pretend after the fact in their publication that they had a theory in advance or that they anticipated observation of a new pattern of data. And of course doing science in this post-hoc fashion leads to selection effects, biases, distortions, and opens the possibility for various forms of abuse. Thus good judgment by scientists is essential. But if science is to progress, we have to endure and live with such problems. Scientists are and should be skeptical and do not and should not accept published results and theories as valid or ‘true’. Rather, reports with data and theory are pointers toward further research, and occasionally the additional research proves the first reports are important and move science forward.

AL: Hi Rich, Thanks for writing. I think the space between our positions is much smaller than you think. You quote me saying: "Preregistering theoretical predictions can help researchers distinguish clearly between using evidence to inform versus test theory" and " Preregistration of analysis plans can help researchers distinguish clearly between data-dependent analyses (which can be interesting but may have unknown type I error) and data-independent analyses (for which P values can be interpreted as diagnostic about the likelihood of a result)." And then you say: "Of course scientists should admit that [basing our inferences and theories on the data after we see it]...is the case, rather than pretend after the fact in their publication that they had a theory in advance or that they anticipated observation of a new pattern of data."

These positions are not in conflict. We are both saying that it's important for scientists to acknowledge when they are using data to inform theory, as well as when they are using data to inform analytic decisions. The first statement of mine that you quote says only that scientists should distinguish clearly between using data to inform versus test theory. I agree with you that it's extremely important and common and useful for scientists to use data to inform theory (and I wish I'd had the space to say that explicitly in the PNAS piece--I was limited to 500 words and tried to do as much as I could in a very tight space). The second statement of mine that you quote does explicitly acknowledge that data-dependent analyses can be interesting (and again, I wish I'd had the space to add that data-dependent analyses are often an important and necessary part of science).

So, I do not presuppose that "there is a problem with basing our inferences on the data after we see the data, and develop our theories based on the data we see" -- I only presuppose that there is a problem with not clearly acknowledging when we do either or both of these things. The two elements of preregistration that I describe can help researchers (a) know and (b) transparently acknowledge when they are using data to inform their theoretical prediction and/or analytic decisions, respectively.

Alternatively, if you don't have the goal of testing theory and/or constraining Type I error, you can simply state that clearly and explicitly in a paper. No need to preregister as long as you're up front on this point. But IF you have the goal of testing a theoretical prediction, preregistering that prediction can be a useful tool to help achieve that goal. And IF you have the goal to constrain Type I error, a carefully specified pre-analysis plan can be a useful tool to help achieve that goal.

Your position seems to be that most scientists don't have and shouldn't have those goals. (Please tell me if that's not a fair characterization.) I think some scientists DO have those goals and that there are many contexts in which one or the other (or both) goals are appropriate and help to advance science. So we perhaps disagree about how common and important these goals are. But I think we are very much in agreement that using data to inform theory is crucial for advancing science, that it's very common, and that researchers should not pretend they are theory-testing when they are in fact theory-building.

RS: Thanks for responding. I guess we mostly agree, although space in the editorial did not allow that to be entirely clear.

However, you say:

“But IF you have the goal of testing a theoretical prediction, preregistering that prediction can be a useful tool to help achieve that goal. And IF you have the goal to constrain Type I error, a carefully specified pre-analysis plan can be a useful tool to help achieve that goal.”

I think we have somewhat different views here, and the difference may depend on one’s philosophy of science. The “IF” may be in question. This may be related to the title of several symposia I am running: “Should statistics govern the practice of science or science govern the practice of statistics?”

Scientists do run experiments to test theories. But that is superficial and the reality is more complex: We know all theories are wrong, and the real goal of ‘testing’ is to find ways to improve theory, which often happens by discovering new and unexpected results during the process of testing. This is seldom said up front or even after the fact. Of course not all scientists realize that this is the real goal…
Mostly this perspective applies to exploratory science. When one wants to apply scientific results, which happens in health and engineering contexts fairly often, then the goal shifts because the aim is often to discover which of two (wrong) theories will be best for the application. One can admit before or after the fact that this is the goal, but in most cases it is obvious that this is the case from the context. There are cases in social psychology where the goal is an application, but I think the great majority of research is more exploratory in character, with potential or actual applications a bit further in the future.

The goal of restraining Type 1 error? Here we get into a statistical and scientific domain that is too long for an email. Perhaps some people have such a goal, but I think this is a poor goal for many reasons, having to do with poor statistical practice, a failure to consider a balance of goals, the need for exploration, and perhaps most important, the need for publication of promising results many of which may not pan out, but a few of which drive science forward.

AL: I think you're exactly right about where our difference in opinion lies, and that it may depend on one's philosophy of science. And I often have the slightly uncomfortable feeling, when I'm having these kinds of debates, that we're to some extent just rehashing debates and discussions that philosophers of science have been having for decades.

Having said that, let me continue on blithely arguing with you in case it leads anywhere useful. :)

You say: "We know all theories are wrong, and the real goal of ‘testing’ is to find ways to improve theory, which often happens by discovering new and unexpected results during the process of testing." I agree with this characterization. But I think we can in many cases identify ways to improve theory more quickly when we are clear about the distinction between using results to inform vs. test theory, and when we are clear about the distinction between results that we can have a relatively high degree of confidence in (because they were based on data-independent analyses) and results that we should view more tentatively (because they were based on data-dependent analyses).

For example, if I am messy about the distinction between informing vs. testing theory, I might run a bunch of studies, fit my theory to the results, but then think of and report those results as "testing and supporting" the theory. That leads me to be too confident in my theory. When I see new evidence that challenges my theory, it may take me longer to realize my theory needs improvement because I have this erroneous idea that it has already received lots of support. I would be more nimble in improving my theory if I kept better track of when a study actually tests it and when a study helps me construct it.

Meanwhile, if I am messy about the distinction between data-independent and data-dependent analyses, I may infer that my study results are stronger than they actually are. If I think I have very strong evidence for a particular effect -- as opposed to suggestive but weaker evidence -- I will be too confident in this effect, and therefore slower to question it when conflicting evidence emerges.

To me, all of this boils down to the simple idea that preregistering predictions and analysis plans can -- if we do it carefully -- help us to more appropriately calibrate our confidence in our theories and our confidence in the strength of evidence provided by a given study. I do not think preregistration is useful in every context, but I do think it can be useful in many contexts and that it's worth considering as a tool to boost the informational value of our research.

RS: All science starts with data and then scientists invent theories to explain the data. A good scientist does not stop there but then explores, tests, and generalizes the new ideas. Sometimes the data comes from ‘expensive’ studies that are not easy to pursue, and then the good scientist will report the data and the proposed theory or hypotheses, but not pretend the ideas were known a priori.

AL: I definitely agree with this!

RS: Let me add a PS: I repeat that I would not object to anyone wanting to pre-register a study, But I don’t think this is the best way to accomplish the goals you profess, and I certainly would not want to see any publication outlet require pre-registration, and I certainly would not like to see pre-registation become the default model of scientific practice.

AL: Very clear -- thank you! Do you want me to link to your PNAS article for anyone who wants to read more about your position on these issues, or would there be a better place to direct them?

RS: Yes. By the way, if you have any reactions to the PNAS article and want to convey them to me, I’d be glad to hear them.

"Exploratory science is a matter of knowledge accretion, often over long periods of time. A metaphor would be a constantly growing tree, occasionally sprouting new and large branches, but one that continually modifies many of its previous branches and leaves, as new knowledge develops. Invalid and irreproducible reports are buds on the tree that have the potential to sprout but occur in positions that do not allow growth. The 'tree' does not know which positions foster growth, and therefore, it forms many buds, some failing and others sprouting." - Shiffrin, Borner, & Stigler, 2018,PNAS.

AL: I think it's brilliant and I especially like the tree metaphor and the point that the existence of many individual failures doesn't mean that science is failing. I think one place where we might differ is in how we see irreproducible reports—you describe them as buds on the tree that don't end up sprouting, but I think sometimes they sprout and grow and use up a lot of resources that would be better spent fueling growth elsewhere. I think this happens through a combination of (a) incentive structures that push scientists to portray more confidence in their theories and data than they should actually have and (b) publication bias that allows through data supporting an initial finding while filtering out data that would call it into question and allow us to nip the line of research in the metaphorical bud. In other words, I think scientists are often insufficiently skeptical about (a.k.a. too confident in) the buds on the tree. I see preregistration as a tool that can in many contexts help address this problem by enabling researchers to better calibrate their confidence in their own and others' data. (Part of the reason that I think this is that preregistration has helped my lab do this—we now work with a much more calibrated sense of which findings we can really trust and which are more tentative...we still explore and find all sorts of cool unexpected things, but we have a better sense of how confident or tentative we should be about any given finding.)

Meanwhile, I think you worry that preregistration will hamper the natural flourishing of the science tree (a totally reasonable concern) and you think that researchers should and perhaps already do view buds with appropriate levels of skepticism (I agree that they should view individual findings with skepticism but I don't think they typically do).

RS: I think the history of science shows science has always wasted and still wastes lots of resources exploring dead ends. The question is whether one can reduce this waste without slowing progress. No one knows, or knows how to carry out a controlled experiment to tell. What we do know is progress is now occurring extremely rapidly, so we best be hesitant to fix the system without knowing what might be the (unintended, perhaps) consequences.

I see no danger in very small steps to improve matters, but don’t want to see large changes, such as demands by journals for pre-registration.

Wednesday, October 31, 2018

Should we believe the p-value of a study result more when the finding fits with (vs. challenges) our expectations?

I keep seeing the same debate about preregistration. (There are several, but one in particular seems to repeat over and over again, at conferences and online, between preregistration advocates and skeptics.) It goes something like this:

Advocate: Preregistration is really important for science because preregistering a study makes the findings more trustworthy.

Skeptic: This is ridiculous! A finding is not more likely to be true just because you happened to correctly predict it ahead of time!

Advocate: Nobody is saying that.

Skeptic: Let’s say you and I both run the exact same study, but we make opposing predictions: I predict A, you predict B. My study shows A, as I expected. Your study shows A, contrary to your expectation. My study’s finding isn’t somehow truer than yours just because I happened to call it correctly ahead of time!

Advocate: NOBODY IS SAYING THAT!!!

I thought I saw, in this repeating debate, a simple but crucial miscommunication: Preregistration advocates were using the term “preregistration” to refer to pre-analysis plans, which constrain researcher degrees of freedom and can help ensure that p-values are interpretable as diagnostic about the likelihood of an outcome. But advocates would also sometimes talk about preregistrations as involving prediction, even though making a directional prediction isn’t necessary for constraining researcher degrees of freedom (the clearest illustration of this confusion is probably this Data Colada blog post in which the researchers who started AsPredicted point out that no prediction is necessary for preregistration, and that in retrospect they probably should have called their website AsPlanned).* And so skeptics would hear the term “preregistration” and think that it meant prediction, even though it often meant pre-analysis plan.

So I wrote this short piece to make what I thought was a minor but important point about clarifying our terminology. I framed it as a reply to one particular article that uses the term “prediction” a lot while advocating for pre-analysis plans, but I tried to emphasize that I was making a broader point about the language many of us use when promoting preregistration.

But I was in for a surprise: It turned out that what I thought was a simple miscommunication was in fact a deeper disagreement. It turns out that yes, some people are indeed saying that a finding is more likely to be true when you correctly predict it ahead of time.

I find this position unnerving because it’s hard for me to see where the line is between it and, say, a person who thinks vaccines cause autism deciding that they don’t believe the scientific evidence to the contrary because it challenges their personal beliefs or expectations. (Presumably, there is a line, but I have yet to see it clearly articulated.)

I had an opportunity to discuss this difference of opinion with two of the authors on the original PNAS paper linked above, Brian Nosek and Charlie Ebersole. The full email discussion is here. I’ll pull out some highlights:

Alison: …I think we might disagree here—or at least, I think we need to distinguish between beliefs or confidence in a THEORY versus beliefs or confidence in a study RESULT. I agree that we should update our confidence in a THEORY based on what it was able to predict ahead of time. I disagree that we should base our confidence in a RESULT on whether it was predicted ahead of time.

I think that in your example [here], the three researchers with different predictions should believe the study result equally based on the statistical evidence.…The strength of the EVIDENCE doesn't change depending on what was predicted ahead of time, but our beliefs about the THEORIES that gave rise to the predictions can and should.

What do you think?

Brian: …If by RESULT you mean knowing something about what the finding is, then there is definitely disagreement. If RESULT A is sequential priming is stronger when the prime precedes compared to follows the targets and RESULT B is sequential priming is stronger when the prime follows compared to precedes the targets, a p-value showing RESULT B is less believable than a p-value showing RESULT A. I don't need any information about the theories that anticipate A versus B to have priors about the results themselves and therefore update those priors based on the statistical evidence. …. priors can be very relevant with no theory. A mouse can have strong priors that pushing a lever produces a pellet and when the light produces the pellet instead, the mouse will not update the priors as much as one that did not have the prior experience. We don't need to assert that the mouse has any theory at all--priors can be based entirely on contingency expectations without any model or explanation for how those contingencies emerged.

Alison: …"How confident am I in my theory or [prior] belief?"…is separate from the question "How confident am I in this study result?" Because the amount/strength/quality of the evidence provided by a single study does not depend on the researcher's prediction. A study is not stronger if a researcher guesses or predicts the result ahead of time (which is what I think you imply when you equate "prediction science" with pre-analysis plans in your original PNAS paper). The quality of the evidence depends on things like whether there was a pre-analysis plan and whether construct validity and internal validity and external validity were high. If all those things are in place, the study might provide very strong evidence. Whether that strong evidence is sufficient to change a researcher's mind about a theory [or prior belief/expectation] may depend on the researcher's degree of confidence in the theory before the study was run. If they are very confident in the theory, then even strong evidence may not be enough to change their mind. But they can't call it weak evidence or a poorly conducted study just because the results turned out to be different from what they expected.

Charlie: …I agree with your concerns about the threats to falsifiability that come from differential interpretations of studies. In my earlier email, I was mainly trying to simulate the reaction that someone might have to learning the results a preregistered study that goes way against their priors/theories/expectations....Someone who believes in ESP and I might be able to agree on how p-values work (although ESP does have some really interesting implications for the ability to construct data-independent analysis plans) but we are likely to not agree on how time works (where our theories disagree). Even if I agree that the p-value from their study is diagnostic and that their study is high quality (high validity and all that), I may still think it's more likely than not that their results represents a true false positive, and not reality, because it's so against my theory and prior beliefs. Again, I'm not saying that I'm being rational or fair in this situation, but it does represent the gap between believing in statistical results, judging the implications of a given result, and then revising theories/beliefs.

Alison: I agree that it's tempting (and human nature) to think: This new information contradicts my existing attitudes and beliefs (including my favorite theoretical predictions), and so I don't think it's as good quality as I would if the same kind of information supported my existing attitudes and beliefs. In fact, I have run studies on exactly this kind of motivated reasoning (e.g., participants believe a scientific study is better in quality when its conclusions support vs. contradict something they want to believe). But this kind of irrational reasoning is NOT good scientific reasoning, and it's arguably a big part of what landed us in our current mess. Scientists need to seek consensus about what constitutes good quality evidence independently from whether that evidence happens to support or oppose their preferred conclusions. Indeed, this is one of the major benefits of reviewed preregistrations (a.k.a. registered reports) like those at Cortex and CRSP and JESP and other journals: Reviewers evaluate the soundness of the study's methods before the results are known.

In the ESP example, I think what you need to say is: "It would take a LOT of high quality evidence to change my belief that people are unable to predict the future," rather than "evidence is high quality to the extent that it confirms my belief about people's ability to predict the future."

Charlie: …I agree that that's not good scientific reasoning. I wasn't trying to display good scientific reasoning, merely just trying to highlight a situation where someone could have different views on 1) the diagnosticity of the statistical evidence (namely the p-value), 2) their interpretation of a study, and 3) their resulting shifts in beliefs/theories.

I do agree with your line: "It would take a LOT of high quality evidence to change my belief that people are unable to predict the future," rather than "evidence is high quality to the extent that it confirms my belief about people's ability to predict the future." Based on my priors, a true false positive (they will happen from time to time, even with preregistration) may seem more likely in a single instance (judging a single study) and thus be my interpretation of the study ("it wasn't a bad study, I think the results were a fluke"). Multiple observations of the effect would then make them being false positives less likely and would force me to confront my beliefs/theories….

[We ended our discussion of this point soon after, so that we could move on to debate our second point of disagreement, which I will turn to in the next post.]

*Unless you’re planning one of a small handful of statistical tests in a NHST framework that do care about the direction of your prediction, like a one-tailed t-test. And of course, Bayesian statistics provide a formal way of integrating a researcher’s prediction (or lack of prediction), their confidence in that prediction/prior belief, and their commitment to update their beliefs based on the strength of evidence observed a given study. But we’re not talking about one-tailed tests or Bayesian statistics here.

Response from Charlie:

First, I want to thank Alison for engaging with us on this topic and for sharing a draft of this blog post before posting it. We had a very interesting conversation about these and other issues, and I’d encourage interested readers to read through the whole exchange. I’d also like to state upfront that I’m speaking just for myself here, as I was in our conversation. I don’t claim to know what Brian thinks (finishing my dissertation would be much easier if I did).

The main point Alison raises is that some folks (much to her surprise) think that findings are more likely to be true if they are predicted ahead of time. I’ll admit that I had some meta-surprise in response to this, as I have apparently missed the hypothetical argument that starts this blog post (I guess I’ve spent my time shouting about other aspects of preregistration [1]). However, that might be more a reflection of different people using terms to mean different things. I tried to explain my reasoning for this in the following paragraph, taken from our exchange:

“Jumping to your last email, Alison, I think I disagree with the statement that "A study is not stronger if a researcher guesses or predicts the result ahead of time" because pre-analysis plans, at least to me, specify a prediction and do provide stronger evidence when provided ahead of time. The variables in our models represent the conceptual relationships we are interested (in) and we can have more confidence in the inferences we make from those models if we've specified them data-independent. This feels to me a little bit like our bank shot metaphor in our response (referring to this: psyarxiv.com/a6k7h). The bank shot is the model that the shooter plans to use to make the basket. The inference that I'm trying to draw from this scenario is whether or not I think the shooter is good at basketball (or at least shooting in basketball?). Whether or not they call bank, I can certainly agree that they made that particular shot. If that's the only shot I care about, I don't much care whether they called it or not (no need to use inferences if you've sampled the entire population of interest). But if I care about drawing further conclusions about the shooter's ability, I will have greater confidence in them if I knew their planned model ahead of time. In that sense, it's better evidence because I've got a more accurate representation of what happened (or a more accurate representation of the relation between prediction and result)”

When we use inferential statistics, we’re trying to infer something about the broader world, broader populations, or future events from what we observed in the data. If a researcher surveyed the political views of 100 undergraduates and only wanted to draw conclusions about those 100 students at that one time, there’d be no need to calculate p-values – they’ve sampled their entire population of interest. However, that’s not the kind of question we typically ask in research (and certainly not how we write our discussion sections). P-values give us a way of thinking about the likelihood of a result given a particular null hypothesis and are a tool we use to judge the likelihood of a finding. P-values also lose their diagnosticity if they come from data-dependent analyses, which isn’t a worry if you’ve called your shot (or model) ahead of time.

So yes, I think a finding is more likely to be true if it is predicted ahead of time. Our predictions are manifested in our statistical models and the results from those models inform our confidence in a finding. As long as we’re using p-values as a way to calibrate that confidence, it’s important to know if we’ve called our shots or not.

[1] Such as “should there be a hyphen in preregistration?” Answer: No.

Monday, October 22, 2018

WHY You Preregister Should Inform the WAY You Preregister

Sometime in the 2012-2014 range, as the reproducibility crisis was heating up in my particular corner of science, I fell in love with a new (for me) approach to research. I fell in love with the research question.

Before this time, most of my research had involved making directional predictions that were at least loosely based on an existing theoretical perspective, and the study was designed to confirm the prediction. I didn’t give much thought to whether, if the study did not come out as expected, it would help to falsify the prediction and associated theory. (The answer was usually no: When the study didn’t come out as expected, I questioned the quality of the study rather than the prediction. In contrast, when the study “worked,” I took it as evidence in support of the prediction. Kids at home: Don’t do this: It’s called motivated reasoning, and it’s very human but not particularly objective or useful for advancing science.)

But at some point, I began to realize that science, for me, was much more fun when I asked a question that would be interesting regardless of how the results turned out, rather than making a single directional prediction that would only be interesting if the results confirmed it. So my lab started asking questions. We would think through logical reasons that one might expect Pattern A vs. Pattern B vs. Pattern C, and then we would design a study to see which pattern occurred.* And, whenever we were familiar enough with the research paradigm and context to anticipate the likely researcher degrees of freedom we would encounter when analyzing our data, we would carefully specify a pre-analysis plan so that we could have high confidence in results that we knew were based on data-independent analyses.

One day, as my student was writing up some of these studies for the first time, we came across a puzzle. Should we describe the studies as “exploratory,” because we hadn’t made a clear directional prediction ahead of time? Or as “confirmatory,” because our analyses were all data-independent (that is, we had exactly followed a carefully specified pre-analysis plan when analyzing the results)?

This small puzzle became a bigger puzzle as I read more work, new and old, about prediction, falsification, preregistration, HARKing, and the distinction between exploratory and confirmatory research. It became increasingly clear to me that it is often useful to distinguish between the goal of theory falsification and the goal of Type I error control, and to be clear about exactly what kinds of tools can help achieve each of those goals.

I wrote a short piece about this for PNAS. Here’s what it boils down to:

1. If you have the goal of testing a theory, it can be very useful to preregister directional predictions derived from that theory. In order to say that your study tested a theory, you must be willing to upgrade or downgrade your confidence in the theory in response to the results. If you would trust the evidence if it supported your theory but question it if it contradicted your theory, then it’s not a real test. Put differently, it’s not fair to say that a study provides support for your theory if you wouldn’t have been willing to say (if the results were different) that the same study contradicted your theory.**

2. If you have the goal of avoiding unintentional Type I error inflation so that you can place higher confidence in the results of your study, it can be very useful to preregister an analysis plan in which you carefully specify the various researcher decisions (or “researcher dfs”) that you will encounter as you construct your dataset and analyze your results. If your analyses are data-independent and if you account for multiple testing, you can take your p-values as diagnostic about the likelihood of your results.***

Why do I think this distinction is so important? Because thinking clearly about WHY you are preregistering (a) helps ensure that the way you preregister actually helps achieve your goals and (b) answers a lot of questions that may arise along the way.

Here’s an example of (a): If you want to test a theory, you need to be ready to update your confidence in the theory in EITHER direction, depending on the results of the study. If you can’t specify ahead of time (even at the conceptual level) a pattern of evidence that would reduce your confidence in the theory, the study is not providing a true test...and you don’t get to say that evidence consistent with the prediction provides support for the theory. (For instance: If a study is designed to test the theoretical prediction that expressing prejudice will increase self-esteem, you must be willing to accept the results as evidence against the theory if you find that expressing prejudice decreases self-esteem.)

Here’s an example of (b): You might preregister before running a study and then find unexpected results. If they aren’t very interesting to you, do you need to find a way to publicize them? The answer depends on why you preregistered in the first place. If you had the goal of combatting publication bias and/or theory testing, the answer is definitely YES. But if your goal was solely to constrain your Type I error rate, you’re done—deciding not to publish obviously won’t increase your risk of reporting a false positive.

Read the (very short) PNAS letter here. You can also read a reply from Nosek et al. here, and two longer discussions (Part I and Part II) that I had with the authors about where we agree and disagree, which I will unpack more in subsequent posts.

---

*The best part of this approach? We were no longer motivated to find a particular pattern of results, and could approach the design and analysis of each study in a much more open-minded way. We could take our ego out of it. We get to yell "THAT'S FASCINATING!" when we analyze our data, regardless of the results.

**In philosophy of science terms, tests of theoretical predictions must involve some risk. The risk in question derives from the fact that the theoretical prediction must be “incompatible with certain possible results of observation” (Popper, 1962, p. 36)—it must be possible for the test to result in a finding that is inconsistent with the theoretical prediction.

***DeGroot (2014) has a great discussion of this point.

Monday, October 15, 2018

You say potato, I say: I've been living in a downpour of hailing potatoes and I'm beyond exhausted, so could you please help me hold this umbrella?

Let’s talk about sexism and science. Not the science of sexism, but sexism in science, and what does “sexism” really mean, and what happens when well-intentioned people who all prioritize diversity and inclusion find themselves on opposite sides of an uncomfortable conversation.

Context is important for understanding the impact of any single event, so I’m going to start there, with one context for each side.

Context #1: Once upon a time, a program committee for a conference with arguably an “old-boys network” reputation wanted to improve the diversity of its conference speakers, and so put a ton of effort and time into creating a conference program that was remarkably diverse in many ways and represented a variety of perspectives. For example, in the entire conference program, there was only a single manel, which, if you’ve been following gender imbalances in who gets asked to share their expertise in the academy and beyond, is impressively low. Meanwhile, there were four symposia with all female speakers, which is impressively high.

Context #2: Once upon a time, world history was what it is and sexism (and racism and homophobia and all the rest) were longstanding, entrenched, systemic problems that members of groups disadvantaged by the existing social hierarchy had to confront every day in a dozen small ways and sometimes also very big ways. The Kavanaugh hearings had brought to the fore just how little women’s voices matter in our society. And many women were feeling exhausted…exhausted by thinking almost constantly about the Kavanaugh hearing, exhausted about reliving past experiences of rape and sexual assault and men not taking “no” for an answer. Exhausted by every single tiny and large barb that we navigate on a daily basis, not just in our personal lives but also as professors. The student who asks for our advice on doing better in another class with a male professor because they “don’t want to bother him, he’s such a real professor you know and I wouldn’t want to waste his time,” the committee meeting where male professors introduce themselves to each other but not to us because they assume we’re the secretary there to take notes, the news articles that forget to interview a single woman scientist, the wildly disproportionate service expectations.

In these two contexts, the manel at the conference (the only manel at the conference, and simultaneously one of many manels in a manel-filled world) became the subject of a conversation. And I walked into the conversation and thought to myself, ugh, I am so tired of manels in the world, and I said something like: “those speakers are all smart, lovely, excellent scholars. but blech. i wish dudes would start refusing to be on panels that had such uneven representation. (although i can imagine you sometimes wouldn’t know until you accepted the invite...and yet how cool would it be if white dudes started politely bowing out of all-male panels? ‘i’m so sorry, i just realized that this will be an all-male panel! you don’t need me on here too...i’d suggest asking Dr. X, a super duper expert on this topic, instead!’)”

Now I’m going to move away from the complexities of this specific conversation and just talk about my own experience of it and what I learned. My experience of this conversation was that I intended to express a sentiment about the world (“Sexism in the world sucks! I’d love to see men taking opportunities to push back against it, when they arise!”), and some of my friends who are men heard a sentiment about individuals (“You are sexist!”).

I think that this experience helps highlight a theme that might be useful to all of us when we have conversations about these kinds of issues. I'm going to borrow from Ijeoma Oluo's kickass book to make this point, pulling out a distinction she makes about race and applying it to gender. (Her book is awesome and worth reading, if you haven’t read it yet. Click here to fix that).

Here's the distinction: We can think about sexism at an individual level (e.g., when a senior male professor asks if I can write up my notes on a committee meeting and send them to everyone, is that sexist or something he'd ask anyone? Did a particular panel end up being all men because of individual sexism or did the organizers try hard to find women speakers?). But we can also think about sexism at the systemic level (e.g., is there a broad pattern of men disproportionately asking women to take on invisible and secretarial kinds of service work? Is there a broad pattern of women scholars being relatively overlooked when it comes to invited symposia, speakers, journal articles, etc.?). I suspect that often, when a woman points out an example of something that feels sexist to her, she is trying to point out SYSTEMIC sexism. For example, I was running on a crowded narrow bridge the other day, and a man bicycled through very quickly, and a woman commented: “Of course the men speed through without stopping.” She didn’t mean this (I suspect) as any kind of comment about that particular man or his intentions in that moment. She was pointing out the impact of a broader pattern. She was venting frustration about how time after time after time after time she steps politely to the side while various men in various contexts barrel through.

I’m going to go a step further and speculate that often, when a woman points out an example of something that feels sexist to her, men may hear her comment as being focused on INDIVIDUAL sexism. For example, if the bicyclist overheard the comment about men speeding through, he might have thought to himself “For goodness sakes! That is so unfair! I just didn’t SEE the people on the bridge in time because I was trying to steer around the dog, and I am like the most feminist guy in the world and WTF is up with this random woman accusing me of being sexist!?! I’m one of the good guys!”

Similarly, when I made my comment above expressing frustration with all-male panels and wishing that more white guys would start refusing to be on them, I was expressing a sentiment about the systemic version of this issue and one way to address it. I actually wasn’t thinking much about this specific panel, because one specific panel isn’t a problem—only the broader pattern is a problem—and because the particular people on this particular panel were all awesome social justice warriors in various ways and so it really wouldn’t occur to me in a million years to call them sexist at an individual level.

If I’m right—if women often see and try to call out systemic sexism, but men often hear and (therefore quite reasonably feel defensive about) claims of individual sexism, then the question becomes how to have productive and respectful conversations about sexism without being derailed by miscommunication.

One answer is that it is up to both sides to work harder to clarify what they mean: Women need to make the extra effort to clarify when they are talking about a broad pattern and not any one particular individual or instance, and to recognize the individual efforts that the individual sitting across from them in the conversation may have put into fighting discrimination and improving inclusion already. And men need to make the extra effort to notice when they are feeling defensive, let those feelings go, recognize that the conversation might be more productive if they can hear it as being about a broad pattern rather than any particular individual behavior, and reflect on how they might do more to help combat the broad pattern in the future.

But I would like to be provocative here and suggest a different answer. I would like to challenge the men reading this to make more than 50% of the effort in conversations like these. I would like to challenge you to make 60% or 80% or even 100% of the effort when necessary. Why? Because you have privilege in this context. And that privilege comes with secret stashes of extra bandwidth. Bandwidth that you didn’t spend nodding and smiling at the student who implied you weren’t a real professor. Bandwidth that you didn’t spend noticing that another newspaper article didn’t interview a single scientist of your gender. Bandwidth that you didn’t spend reserving a room or finding an article when a senior male colleague explained that he just didn’t understand how to deal with the online system and could you please just do it for him. Bandwidth that you didn’t spend editing that email for the fourth time to make sure it wouldn’t come across as too combative. Bandwidth that you didn’t spend pouring hours into conversation after conversation where you try to very patiently explain your lived experience to men who don’t understand or fully try to understand it in various ways. Bandwidth that you could, if you so choose, spend in listening when women take the time and energy to point out systemic sexism and respond without defensiveness, with a simple “I am listening.” Bandwidth that you could use to amplify women’s voices with your own voice that is louder and further reaching in this context.

Meanwhile, white women reading this: I have a challenge for us, as well. We have privilege in so many other contexts. And that privilege comes with secret stashes of extra bandwidth. And the next time we’re sitting across from someone, virtually or in person, who says “here is my lived experience as a person of color”—let’s just sit there and LISTEN. Even if we want to say “but I’m one of the good ones!” Even if we are tempted to explain something about our own perspective in response. Instead, let’s remember that screamy feeling in our chest, the exhaustion in our heads and hearts these days. Let’s tap into the secret bandwidth stash we bring to conversations about race, and let’s and respond with “I hear you. I am listening.” And let’s amplify other people’s voices with our own that are louder and further reaching in this context.

Monday, March 5, 2018

Preregistration is a Hot Mess. And You Should Probably Do It.

There’s this episode of Family Guy where some salesman offers Peter a choice between two prizes: a boat or a mystery box. Peter can’t help himself—he reaches for the mystery box. “A boat's a boat, but a mystery box could be ANYTHING!” he says. “It could even be a boat!”

Mystery boxes are like that. Full of mystery and promise. We imagine they could be anything. We imagine them as their best selves. We imagine that they will Make Life Better, or in science, that they will Make Science Better.

Lately, I’ve been thinking a lot about what’s inside the mystery box of preregistration.

Why do I think of preregistration as a mystery box? I’ll talk about the first reason now, and come back to the second in a later post. Reason the First is that we don’t know what preregistration is or why we’re doing it. Or rather, we disagree as a field about what the term means and what the goals are, and yet we don’t even know there’s disagreement. Which means that when we talk about preregistration, we’re often talking past each other.

Here are three examples that I gave in my talk last week at SPSP.

Back in the olden days, in June 2013, Chris Chambers and colleagues published an open letter in the Guardian entitled “Trust in Science Would be Improved by Preregistration." They wrote: “By basing editorial decisions on the question and method, rather than the results, pre-registration overcomes the publication bias that blocks negative findings from the literature.” In other words, they defined preregistration as peer-review and acceptance of articles before the results are known, and they saw the major goal or benefit of preregistration as combatting publication bias.

Fast forward to December 2017: Leif Nelson and colleagues publish a blog post on “How to Properly Preregister a Study.” They define preregistration as “time-stamped documents in which researchers specify exactly how they plan to collect their data and to conduct their key confirmatory analyses,” and they explain that the goal is “to make it easy to distinguish between planned, confirmatory analyses…and unplanned exploratory analyses.”

Then just a few weeks ago, out of curiosity, I posted the following question on PsychMAP: A researcher tweets: "A new preregistered study shows that X affects Y." What would you typically infer about this study? By far the most popular answer (currently at 114 votes, compared to the next most popular of 44) was that the main hypothesis was probably a priori (no HARKing). In other words, many psychologists seem to think preregistration means writing down your main hypothesis ahead of time, and perhaps that a primary goal is to avoid HARKing.

So if I tell you that I preregistered my study, what do I mean? And why did I do it—what was my goal?

I think we are in desperate need of a shared and precise language to talk about preregistration. Just like any other scientific construct, we’re not going to make much headway on understanding it or leveraging it if we don’t have a consensual definition of what we are talking about.

It seems to me that there are (at least) four types of preregistration, and that each one serves its own potential function or goal. These types of preregistration are not mutually exclusive, but doing any one of them doesn’t necessarily mean you’re doing the others.

Notice that I said POTENTIAL benefit or function. That’s crucial. It means that depending on how you actually implement your preregistration, you may or may not achieve the ostensible goal of that type of preregistration. Given how important these goals are for scientific progress, we need to be paying attention to how and when preregistration can be an effective tool for achieving them.

Let’s say you want to combat publication bias, so you preregister your study on AsPredicted or OSF. Who is going to be looking for your study in the future? Will they be able to find it, or might the keywords you’re using be different from the ones they would search for? Will they be able to easily find and interpret your results? Will you link to your data? If so, will a typical researcher be able to understand your data file at a glance, or did you forget to change the labels from, say, the less-than-transparent VAR1043 and 3REGS24_2?

Let’s say you have a theory that’s ready to be tested.* So you record your hypothesis ahead of time: “Stereotyping will increase with age.” But what kind of stereotypes? Measured how? The vagueness of the prediction leaves too much wiggle room for HARKing later on—and here I mean HARKing in the sense of retroactively fitting the theory to whatever results you happen to find. If you find that racial stereotypes get stronger but elderly stereotypes get weaker as people age, your vague a priori prediction leaves room for your motivated human mind to think to itself “well of course, the prediction doesn’t really make sense for stereotypes of the elderly, since the participants are themselves elderly.” Obviously, adjusting your theory in light of the data is fine if you’re theory building (e.g., asking “Do various stereotypes increase with age, and if so, when?”), but you wouldn’t want to do it and then claim that you were testing a theory-derived hypothesis in a way that allowed for the theory to be falsified. [Update: As Sanjay Srivastava pointed out in a PsychMAP discussion about this post, it's important to recognize that often, researchers wishing to provide a strong and compelling test of a theory will want to conduct a preregistration that combines elements 2 and 3—that is, specific, directional predictions about particular variables that are clearly derived from theory PLUS clear a priori constraints on researcher degrees of freedom.]

Or let’s say that want to be able to take your p-value as an indicator of the strength of evidence for your effect, in a de Grootian sense, and so you preregister a pre-analysis plan. If you write down “participants have to be paying attention,” it doesn’t clearly constrain flexibility in data analysis because there are multiple ways to decide whether participants were paying attention (e.g., passing a particular attention check vs. how long participants spent completing the survey vs. whether a participant clicked the same number for every item on a survey). If you want to avoid unintentional researcher degrees of freedom (or “HARKing” in the sense of trying out different researcher decisions and selecting only the ones that produce the desired results), you need to clearly and completely specify all possible researcher decisions in a data-independent way.**

In fact, the registered report is really the only kind of preregistration on here that’s straightforward to implement in an effective way, because much of the structure of how to do it well has been worked out by the journal and because the pieces of the preregistration are being peer reviewed.

Which brings me to the second reason why I call preregistration a mystery box: What’s inside a preregistration when it HASN’T been part of a peer reviewed registered report? Maybe not what you would assume. Stay tuned.

---

*Many of us don’t do a lot of theory testing, by the way—we might be collecting data to help build theory, or asking a question that’s not very theoretical but might later end up connecting speculatively to some theories in potentially interesting ways (the sort of thing you do in a General Discussion)—but we’re not working with a theory that generates specific, testable predictions yet.

**Yeah, so, we’ve been using “HARKing” to mean two things…sometimes we use it to mean changing the theory to fit the results, which hampers falsifiability, and sometimes we use it to mean reporting a result as if it’s the only one that was tested, which hampers our ability to distinguish between exploratory and confirmatory analyses. (In his 1998 article, Kerr actually talked about multiple variants of HARKing and multiple problems with it.)***

***We’ve also been using “exploratory/confirmatory” to distinguish between both exploratory vs. confirmatory PREDICTIONS (do you have a research question or a directional prediction?) and exploratory vs. confirmatory ANALYSES (are your analyses data-dependent or data-independent/selected before seeing the data).****

****Did I mention that our terminology is a hot mess?

Tuesday, January 16, 2018

You're Running out of Time

Maybe you’ve been meaning to get to it. Maybe you keep thinking, I just don’t have time right now, but soon, as soon as I submit this paper, as soon as I finish teaching this class. Maybe you’re waiting for it to blow over. Maybe it feels like a choice between work and rubbernecking to watch some kind of field-wide car crash, and you’ve been choosing to work.

Or maybe it seems like a social psychology problem, and you’re not in that area, so it doesn’t even apply to you. In any case, business as usual. Onward and upward, publish or perish, keep on moving, nothing to see here.

Here’s the problem, though.

You’re running out of time.

This pesky “crisis” thing? It isn’t going away.* It isn’t limited to one area of psychology, or even just psychology. It’s not something you can ignore and let other people deal with. And it isn’t even something you can put off grappling with in your own work for just another month, semester, year, two years. The alarm bells have been sounded—alarms bells about replicability, power, and publication bias—and although these concerns have been raised before and repeatedly, a plurality of scholars across scientific disciplines are finally listening and responding in a serious way.

Now, it takes time to change your research practices. You have to go out and learn about the problems and proposed solutions, you have to identify which solutions make sense for your own particular research context, you have to learn new skills and create new lab policies and procedures. You have to think carefully about things like power (no, running a post-hoc power analysis to calculate observed power is not a good idea) and preregistration (like why do you want to preregister and which type of preregistration will help you accomplish your goals?), and you probably have to engage in some trial and error before you figure out the most effective approaches for your lab.

So a few years ago, when someone griped to me about seeing a researcher present a conference talk with no error bars in the graphs, I nodded sympathetically but also expressed my sense that castigating the researcher in question was premature. Things take awhile to percolate through the system. Not everybody hears about this stuff right away. It might take people awhile to go back through every talk and find every dataset and add error bars. Let’s have some patience. Let’s wait for things to percolate. Let’s give people a chance to learn, and try new things, and improve their research practices going forward, and let’s give that research time to make its slow way through the publication process and emerge into our journals.

Now, though? It’s 2018. And you’re submitting a manuscript where you interpret p = .12 on one page as “a similar trend emerged” that is consistent with your hypothesis, and on another page you use another p = .12 to conclude that “there were no differences across subsamples, so we do not investigate this variable further”…or you’re writing up a study where you draw strong conclusions from the lack of a significant difference on a behavioral outcome between 5 year olds and 7 year olds, with a grand total of 31 children per group and no discussion of the limited reliability of your measure?

Or you’re giving a talk…a new talk, about new data…and you haven’t put error bars on your graphs? And for your between-subjects interaction…for a pretty subtle effect …you collected TWENTY people per cell? And you don’t talk about power at all when you’re describing the study? Or the next study? Or the next?

Well now you’ve lost me. I’m looking out the window. I’m wondering why I’m here. Or actually, I’m wondering why YOU’RE here. Why are you here?

Are you here to science?

Well then. It’s time to pay attention.

Here is one good place to start.**

*Note, I'm not here to debate how bad the replicability crisis is. Lots of other people seem to find value in doing that, but I'm personally more interested in starting with a premise we can all agree on -- i.e., that there's always room for improvement -- and making progress on those improvements.

**And let me just emphasize that word start. I'm not saying you're out of time to finish making all improvements to your research methods and practices -- in fact, I see improving methods and practices as a process that we can incorporate into our ongoing research life, not something that really gets finished. Again, nothing is ever perfect...we can always be looking for the next step. But I do think it's time for EVERYONE to be looking for, and implementing, whatever that next step is in their own particular research context. If you find that you're still on the sidelines -- get in the game. This is not something to watch and it's not something to ignore. It's something you need to be actively engaged in.