I
keep seeing the same debate about preregistration. (There are several, but one
in particular seems to repeat over and over again, at conferences and online,
between preregistration advocates and skeptics.) It goes something like this:
Advocate: Preregistration is really important for science because preregistering
a study makes the findings more trustworthy.
Skeptic: This is ridiculous! A finding is not more likely to be true
just because you happened to correctly predict it ahead of time!
Advocate: Nobody is saying that.
Skeptic: Let’s say you and I both run the exact same study, but we make
opposing predictions: I predict A, you predict B. My study shows A, as I
expected. Your study shows A, contrary to your expectation. My study’s finding
isn’t somehow truer than yours just because I happened to call it correctly
ahead of time!
Advocate: NOBODY IS SAYING THAT!!!
I thought I saw, in this repeating debate, a simple but crucial
miscommunication: Preregistration advocates were using the term “preregistration”
to refer to pre-analysis plans, which constrain researcher degrees of
freedom and can help ensure that p-values are interpretable as diagnostic about
the likelihood of an outcome. But advocates would also sometimes talk about
preregistrations as involving prediction, even though making a directional
prediction isn’t necessary for constraining researcher degrees of freedom (the
clearest illustration of this confusion is probably this Data Colada blog post in which the
researchers who started AsPredicted point out that no prediction is necessary
for preregistration, and that in retrospect they probably should have called
their website AsPlanned).* And so skeptics would hear the term “preregistration”
and think that it meant prediction,
even though it often meant pre-analysis
plan.
So I wrote this
short piece to make what I thought was a minor but important point about clarifying
our terminology. I framed it as a reply to one particular
article that uses the term “prediction” a lot while advocating for
pre-analysis plans, but I tried to emphasize that I was making a broader point
about the language many of us use when promoting preregistration.
But I was in for a surprise: It turned out that what I thought was a
simple miscommunication was in fact a deeper disagreement. It turns out that yes, some people are indeed saying that a finding is more likely to be true when you
correctly predict it ahead of time.
I find this position unnerving because it’s hard for me to see where
the line is between it and, say, a person who thinks vaccines cause autism deciding that they don’t believe the scientific evidence to the contrary because it challenges
their personal beliefs or expectations. (Presumably, there is a line, but I
have yet to see it clearly articulated.)
I had an opportunity to discuss this difference of opinion with two of
the authors on the original PNAS paper linked above, Brian Nosek and Charlie
Ebersole. The full email discussion is here. I’ll pull out some highlights:
Alison: …I think we might
disagree here—or at least, I think we need to distinguish between beliefs or
confidence in a THEORY versus beliefs or confidence in a study RESULT. I agree that we should update our
confidence in a THEORY based on what it was able to predict ahead of time. I
disagree that we should base our confidence in a RESULT on whether it was
predicted ahead of time.
I
think that in your example [here], the
three researchers with different predictions should believe the study
result equally based on the statistical evidence.…The strength
of the EVIDENCE doesn't change depending on what was predicted ahead of time,
but our beliefs about the THEORIES that gave rise to the predictions can and
should.
What
do you think?
Brian: …If by RESULT you mean
knowing something about what the finding is, then there is definitely
disagreement. If RESULT A is
sequential priming is stronger when the prime precedes compared to follows the
targets and RESULT B is sequential priming is stronger when the prime follows
compared to precedes the targets, a p-value showing RESULT B is less believable
than a p-value showing RESULT A. I don't need any information about
the theories that anticipate A versus B to have priors about the results
themselves and therefore update those priors based on the statistical evidence.
…. priors can be very
relevant with no theory. A mouse can have strong priors that pushing a
lever produces a pellet and when the light produces the pellet instead, the
mouse will not update the priors as much as one that did not have the prior
experience. We don't need to assert that the mouse has any theory at
all--priors can be based entirely on contingency expectations without any model
or explanation for how those contingencies emerged.
Alison: …"How
confident am I in my theory or [prior] belief?"…is separate from the
question "How confident am I in this study result?" Because the
amount/strength/quality of the evidence provided by a single study does not
depend on the researcher's prediction. A study is not stronger if a researcher
guesses or predicts the result ahead of time (which is what I think you imply
when you equate "prediction science" with pre-analysis plans in your
original PNAS paper). The quality of the
evidence depends on things like whether there was a pre-analysis plan and
whether construct validity and internal validity and external validity were
high. If all those things are in place, the study might provide very strong
evidence. Whether that strong evidence is sufficient to change a researcher's
mind about a theory [or prior belief/expectation] may depend on the
researcher's degree of confidence in the theory before the study was run. If
they are very confident in the theory, then even strong evidence may not be
enough to change their mind. But they can't call it weak evidence or a poorly
conducted study just because the results turned out to be different from what
they expected.
Charlie: …I agree with your
concerns about the threats to falsifiability that come from differential
interpretations of studies. In my earlier email, I was mainly trying to
simulate the reaction that someone might have to learning the results a
preregistered study that goes way against their priors/theories/expectations....Someone
who believes in ESP and I might be able to agree on how p-values work (although
ESP does have some really interesting implications for the ability to construct
data-independent analysis plans) but we are likely to not agree on how time
works (where our theories disagree). Even if I agree that the p-value from
their study is diagnostic and that their study is high quality (high validity
and all that), I may still think it's more likely than not that their results
represents a true false positive, and not reality, because it's so against my
theory and prior beliefs. Again, I'm not saying that I'm being rational or fair
in this situation, but it does represent the gap between believing in statistical results, judging the implications of a given result, and then revising theories/beliefs.
Alison: I agree that it's
tempting (and human nature) to think: This new information contradicts my
existing attitudes and beliefs (including my favorite theoretical predictions),
and so I don't think it's as good quality as I would if the same kind of
information supported my existing attitudes and beliefs. In fact, I have run
studies on exactly this kind of motivated reasoning (e.g., participants believe
a scientific study is better in quality when its conclusions support vs.
contradict something they want to believe). But this kind of irrational
reasoning is NOT good scientific reasoning, and it's arguably a big part of
what landed us in our current mess. Scientists need to seek consensus about
what constitutes good quality evidence independently from
whether that evidence happens to support or oppose their preferred conclusions.
Indeed, this is one of the major benefits of reviewed preregistrations (a.k.a.
registered reports) like those at Cortex and CRSP and JESP and other journals:
Reviewers evaluate the soundness of the study's methods before the
results are known.
In the ESP example, I think what you need to say is: "It would take a LOT of high quality evidence to change my belief that people are unable to predict the future," rather than "evidence is high quality to the extent that it confirms my belief about people's ability to predict the future."
In the ESP example, I think what you need to say is: "It would take a LOT of high quality evidence to change my belief that people are unable to predict the future," rather than "evidence is high quality to the extent that it confirms my belief about people's ability to predict the future."
Charlie: …I agree that that's not good scientific reasoning. I
wasn't trying to display good scientific reasoning, merely just trying to
highlight a situation where someone could have different views on 1) the
diagnosticity of the statistical evidence (namely the p-value), 2) their
interpretation of a study, and 3) their resulting shifts in beliefs/theories.
I do agree with your line: "It would take a LOT of high quality evidence to change my belief that people are unable to predict the future," rather than "evidence is high quality to the extent that it confirms my belief about people's ability to predict the future." Based on my priors, a true false positive (they will happen from time to time, even with preregistration) may seem more likely in a single instance (judging a single study) and thus be my interpretation of the study ("it wasn't a bad study, I think the results were a fluke"). Multiple observations of the effect would then make them being false positives less likely and would force me to confront my beliefs/theories….
I do agree with your line: "It would take a LOT of high quality evidence to change my belief that people are unable to predict the future," rather than "evidence is high quality to the extent that it confirms my belief about people's ability to predict the future." Based on my priors, a true false positive (they will happen from time to time, even with preregistration) may seem more likely in a single instance (judging a single study) and thus be my interpretation of the study ("it wasn't a bad study, I think the results were a fluke"). Multiple observations of the effect would then make them being false positives less likely and would force me to confront my beliefs/theories….
[We ended our discussion of this point soon after, so that we could move on to debate our second point of disagreement, which I will turn to in the next post.]
--
*Unless you’re planning one of a small handful of statistical tests in
a NHST framework that do care about the direction of your prediction, like a
one-tailed t-test. And of course, Bayesian
statistics provide a formal way of integrating a researcher’s prediction (or
lack of prediction), their confidence in that prediction/prior belief, and
their commitment to update their beliefs based on the strength of evidence observed
a given study. But we’re not talking about one-tailed tests or Bayesian
statistics here.
--
Response from Charlie:
First, I want to thank Alison for engaging
with us on this topic and for sharing a draft of this blog post before posting
it. We had a very interesting conversation about these and other issues, and
I’d encourage interested readers to read through the whole exchange. I’d also
like to state upfront that I’m speaking just for myself here, as I was in our
conversation. I don’t claim to know what Brian thinks (finishing my
dissertation would be much easier if I did).
The main point Alison raises is that some folks
(much to her surprise) think that findings are more likely to be true if they
are predicted ahead of time. I’ll admit that I had some meta-surprise in
response to this, as I have apparently missed the hypothetical argument that
starts this blog post (I guess I’ve spent my time shouting about other aspects
of preregistration [1]).
However, that might be more a reflection of different people using terms to
mean different things. I tried to explain my reasoning for this in the
following paragraph, taken from our exchange:
“Jumping to your last
email, Alison, I think I disagree with the statement that "A study is not
stronger if a researcher guesses or predicts the result ahead of time"
because pre-analysis plans, at least to me, specify a prediction and do provide
stronger evidence when provided ahead of time. The variables in our models represent
the conceptual relationships we are interested (in) and we can have more
confidence in the inferences we make from those models if we've specified them
data-independent. This feels to me a little bit like our bank shot metaphor in
our response (referring to this: psyarxiv.com/a6k7h). The bank shot is the
model that the shooter plans to use to make the basket. The inference that I'm
trying to draw from this scenario is whether or not I think the shooter is good
at basketball (or at least shooting in basketball?). Whether or not they call
bank, I can certainly agree that they made that particular shot. If that's the
only shot I care about, I don't much care whether they called it or not (no
need to use inferences if you've sampled the entire population of interest). But
if I care about drawing further conclusions about the shooter's ability, I will
have greater confidence in them if I knew their planned model ahead of time. In
that sense, it's better evidence because I've got a more accurate
representation of what happened (or a more accurate representation of the
relation between prediction and result)”
When we use
inferential statistics, we’re trying to infer something about the broader
world, broader populations, or future events from what we observed in the data.
If a researcher surveyed the political views of 100 undergraduates and only
wanted to draw conclusions about those 100 students at that one time, there’d be
no need to calculate p-values – they’ve sampled their entire population of
interest. However, that’s not the kind of question we typically ask in research
(and certainly not how we write our discussion sections). P-values give us a
way of thinking about the likelihood of a result given a particular null
hypothesis and are a tool we use to judge the likelihood of a finding. P-values
also lose their diagnosticity if they come from data-dependent analyses, which
isn’t a worry if you’ve called your shot (or model) ahead of time.
So yes, I think a
finding is more likely to be true if it is predicted ahead of time. Our
predictions are manifested in our statistical models and the results from those
models inform our confidence in a finding. As long as we’re using p-values as a
way to calibrate that confidence, it’s important to know if we’ve called our
shots or not.
[1] Such as “should there be a hyphen in preregistration?” Answer: No.