Can you estimate probabilities without data?
Reasoning about the lab leak hypothesis (or whatever else you want to know)
The “lab leak hypothesis” (which, if you’ve been living under a rock recently, is the possibility that Covid leaked out of a lab in Wuhan, rather than having passed to humans from a bat or other animal) is a great substrate to a conversation about how to reason in the absence of hard data.
This post is going to violate Betteridge’s law of headlines in that I will argue “yes” to the question. You can indeed put hard numbers behind soft data.
How does this work? To answer, we need to talk (a bit) about what it means to be “Bayesian”. Bayesian is a word that gets thrown around a bunch, and like many buzzwords the term is overloaded with meanings. Today, though, we will focus on just one sense of the word, which covers how to update your beliefs when you get new evidence. For Thursday’s post, I’ll cover a simple way to think about the probabilities of things that either will happen or won’t, as opposed to things that can sensibly happen, say, 23% of the time.
Imagine that you’re not feeling well and want to get tested for Covid. This leaves us with four possibilities
You test positive. This means that we’re now in the second column. The question now is, how likely are you to actually have covid? As you can see, there are two possibilities -- you could have gotten a “false positive” (testing positive without having it) or a “true positive” (testing positive while having it). The ratio of true positives to all positives (both true and false) gives us the probability that you now have covid.
We can represent these probabilities as areas. The universe throws a dart at this square, and where it lands is what happens; larger areas are higher probabilities. It might look like this:
Now we can think of this “Covid and test positive” area as being either of
First asking “what is the probability that I tested positive?” and then “what is the probability that I have covid if I test positive?” or
“What is the probability that I have covid, and then “what is the probability that I test positive if I have covid?”
In an equation, that looks like this:
p(covid if positive) * p(positive) =
p(positive if covid) * p(have covid)
Let’s rearrange this by one line:
p(covid if positive) =
p(positive if covid) *
p(covid) /
p(positive)
This one line rearrangement of terms is Bayes’ theorem. That one you’ve heard so much about. It’s literally just this one line. Not so crazy, huh?
Anyway, this tells you how to update your internal likelihood of having covid after you receive a positive test. There are a few important terms here that require some explanation. The p(have covid) term is just the background rate of having covid. It’s important because if having covid were very rare, then the number of people getting a false positive would be more than the number of people getting a true positive. This can be true even if the test is very accurate! This situation would look something like this:
This is called the prior probability. Prior because it’s before you see any evidence. The p(test positive if have covid) term is called the likelihood, but I like to think of it along with the term in the denominator as the joint term p(test positive if have covid) / p(test positive), because it tells you how strong your evidence is. If the tests were so bad that you just always got a positive test back, then this term is 1 and the positive test is no evidence at all. You simply get your prior back. If you almost never got a positive test unless you actually had covid, then this ratio would be very large, and you would increase your prior by quite a bit.
The term on the left hand side is called the posterior, because it’s after you see some evidence (in this case, a positive test). Basically, you take your prior probability (what % of people in the US have covid today?), update it by the strength of your evidence (how good is this test?) and you get your after-evidence probability, the posterior.
Okay so what does have to do with covid’s origins? Well, there has been some “evidence” that covid may have leaked from the lab. To keep things simple, I’ll just use one element, which is that a few folks from the lab got sick with flu-like symptoms in November 2019, right before it started to spread. So we can use our equation:
p(covid leaked | people from lab got sick) =
p(people from lab sick | covid leaked) *
p(covid leaked) /
p(people from lab sick)
The first thing to interject is that:
When Nate Silver said that his priors had changed, what he really meant is that he had obtained a posterior from evidence. But, to be fair to Nate, the virtue of Bayesian reasoning is that your posteriors become new priors in the face of new evidence. So Nate wasn’t wrong, either.
So what is the probability that covid leaked from a lab? This all depends on the values that you assign to the terms above. Based on my very rough sense of the debate, most experts seem to place a vanishingly small value on the prior p(leaked from lab) because the evidence for an alternate hypothesis is very compelling. But lab leaks aren’t entirely unheard-of! Here is a thread of leaks that have happened in the past.
What about this ratio p(people from lab got sick | leaked from lab) / p(people from lab got sick)? Well, I think that the probability that people from the lab got sick, if it turns out that covid did leak from the lab, is a very high number, because this would almost certainly be the way that covid would accidentally leak from the lab.
But, people get sick all the time, especially in flu season (November). But these folks were sick enough to go the hospital, so really these terms should maybe be adjusted to:
p(sick enough to go to hospital | leaked from lab) /
p(sick enough to go to hospital)
So I think that this is actually fairly strong evidence. People were much more likely to have to go to the hospital with covid than for the regular flu, and if it turns out that covid leaked from the lab, people from the lab getting sick would be, like, the main way for that to happen. I’m just throwing a random number out there but let’s say this people working at the lab getting sick enough to go the hospital is 20x more likely if the virus did leak from the lab than if it did not.
Well, if your prior is that there was a 1% chance, this evidence takes you to a 20% chance. If your prior was a 0.01% chance, this takes you to a 0.2% chance.
Where do I stand? I haven’t done enough research to have a good sense of what my priors should be. A 0.5% prior seems fair to me in that it is small but nonzero. That would put me at a roughly 10% posterior that covid leaked from the lab.
For Thursday this week? We’ll discuss what it even means to think that there’s a 10% chance of something happened, when it either happened or didn’t. In short, we’ll discuss this guy’s critique of the whole thing:
Awesome post. Extremely clear and great graphics. Also major TIL on Betteridge's law