I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract perceiver of the world.
— Borges (1941) (quoted from Gelman & Loken 2013)
Over the course of a research project, researchers will make several decisions about the data and their model: should we use a five point scale, or seven? Should we code this variable as dichotomous, or as continuous? Should we only use complete cases or impute missing data? Should we use a poisson regression, or negative binomial?
One pleasing way to think of this is that your dataset (and then your analyses) is one realization from a multiverse of possibilities. Like physicists who study real multiverses, we should start from the principle that our dataset shouldn’t be “special” in any way; if we happen to observe certain parameters fixed at certain values, they should have some natural reason to be there.
If we, say, happen to estimate a coefficient of 2.3 when regressing support for a given policy on the media spend from an advocacy campaign, we should consider what we would have estimated for this value across all of the different datasets that would have arisen from different choices made when collecting, processing, and analyzing the data.
A simple way to approximate this thought experiment is to actually give the raw data to different, isolated teams, and see what they come up with. That’s exactly what the researchers did in this experiment:
Two published causal empirical results are replicated by seven replicators each.
The results are scary:
We find large differences in data preparation and analysis decisions, many of which would not likely be reported in a publication. No two replicators reported the same sample size. Statistical significance varied across replications, and for one of the studies the effect's sign varied as well. The standard deviation of estimates across replications was 3-4 times the typical reported standard error.
I added emphasis to the last sentence, because that’s the key result. The reported standard error is, well, reported -- it’s the model’s uncertainty as understood by both the analysts and decision-makers. The “multiverse uncertainty” -- the range of estimates that could have been obtained by equally credible ways of analyzing the data -- is several times the range that’s reported.
So the uncertainty that you see is just the tip of the iceberg. This is the uncertainty given a single model and dataset. The true error is all of the uncertainty across all plausible models and datasets overlaid on each other.
Academics are familiar with spot-checking for this uncertainty by doing robustness checks, which is to use a few different specifications of your model and/or your data, and to see if the main result holds. These kinds of checks, however, are not up to the scale of the challenge. Given ten independent choices, we’re in the range of a thousand permutations of the data and analysis. Twenty choices and we’re at a million.
So what are we left to do? The inspiration for this post’s title essentially suggests brute forcing the problem by enumerating the choices made when processing the data and in performing the analysis, and doing versions using the opposite choices. This seems brittle (not to mention exhausting) but at the moment it does seem like all we have, at least when performing analyses.
When consuming analyses, we have the ability, as always, to take things with a grain of salt, and to reserve judgment until analyses are repeated and confirmed. At the very minimum we should require that analyses are performed (and documented) in multiple ways, representing at least sample of the data multiverse.
This is super fascinating and I feel like more people should be thinking/talking about this