How many N do we need to be significant?

On significance vs. power

Jun 01, 2021

A quick note: I found out over the weekend that Applied Inference had been featured by Substack. I want to thank all the new folks that found this little newsletter via that page. If you have any questions, feedback, or just want to start a conversation, please leave a comment below or email me at thomas.vladeck@gmail.com.

One of the most common questions we get asked at Gradient is how large a sample we need to “be significant”. I understand why: the people we work with want to make sure that they can be confident in the results we obtain, and “significance” is the framework they’ve been taught (or absorbed) for understanding whether or not something is “real”.

It’s a messy situation; the terms “significance” (and “confidence”) are overloaded, and stakeholders aren’t necessarily clear in their own minds what exactly they’re asking. They just want to be confident in the results, and “significance” (or “confidence”) are the keywords they’ve been trained to ask about.

Regardless of what they’re asking, here’s the question you should be answering: “what is our power?”

“Power” is a statistical term of art, just like significance. In the paragraphs that follow, I’ll explain how they’re different, why power is the right number to worry over before you collect the data and build your model, and why significance is what you want to ask about afterwards.

“Significance”, in the classical statistical framework, roughly translates to “we’re pretty sure this isn’t a false positive”. Your data and model tell you that there is a relationship between your two variables, but significance asks if it would have been a weird result if there were actually no relationship between them (i.e., a false positive). How weird? Would it have had to have been in the top 5% of extreme results? Top 1%? This last value is the “confidence” level, which is the threshold for weirdness one uses when assessing significance.

“Power” is the other side of the coin. Instead of being concerned with a false positive, it is concerned with the possibility of a false negative. A false negative would be a situation where a relationship between two variables actually exists, but you can’t show it at an acceptable level of significance with the data that you have.

A typical power calculation goes like this: “assume Y increases by 5 when we increase X by 1, on average. If we ran this experiment a million times, what percentage of the time would we get a significantly positive result?” The answer to this question is your power.

Why is this important? Let’s go through a quick example:

We all know about election polls, so let’s use them. The ones that show the race as a +/- figure; e.g. “Biden is up +5, with 52.5% of the two-party vote share, relative to Trump’s 47.5%”. These polls typically have about 1,000 respondents, because the margin of error of a percentage with 1,000 random responses is about 3%. (These polling MOE estimates ignore several other sources of error, and as a rule of thumb should be doubled, but that’s another conversation).

Well, here’s the trouble. Let’s assume that the actual difference (regardless of what we see in the polling data) is 5%, with Biden leading. If we ran a gazillion 1,000-person polls, how often would we be confident that Biden is even leading? (Let alone by 5%.) Just about 36%. This means that

if there were actually a 5% Biden lead, and you collected a gold-standard 1,000 person poll, you’d walk away with a less-than-significant result 64% of the time.

If the +/- difference were smaller -- just 1% -- you’d get a less-than-significant result 95% of the time. In order to have an 80% chance of detecting a 1% difference, you’d need 75,000 responses. (This is not a typo, it is literally 75x the original sample)

This shows that typical national surveys are hopelessly underpowered for the typical leads that candidates have in presidential elections. If you’re trying to tell who is ahead from just one poll, you’re going to have a bad time. This is why aggregators like FiveThirtyEight, the Upshot, RealClearPolitics, and the Economist are so important.

Sharp readers will have noticed that “significance” plays a role in the power calculation and in my example above. In fact, a key determinant of power is your criteria for significance -- do you need to be at a 90% confidence level to be significant? 95%? Each threshold will have a different power. (In the example above, I used 95%)

The process of calculating your power forces you to think explicitly about the strength of the relationship you’re looking for, often called the “effect size” -- and every other part of your study design. There is no one power; there is a power for every combination of sample size, effect size, signal-to-noise ratio, sampling methodology, &c. Above, when I moved from a +/- of 5% to 1%, the power dropped from 36% to 5%. Anything that affects your parameter of interest (either what it’s likely to be, or how you estimate it) will affect the power of your experiment.

And this is why it’s so important to think about power before you collect the data: because all these things haven’t happened yet! You haven’t collected a particular sample, you haven’t run the model, &c., so these can still be changed. You may even realize that you’re hopelessly underpowered, and it doesn’t make sense spending the money collecting data to try and answer the question this way. If you thought there were only a 1% difference between the candidates and you didn’t have the budget to collect more than 1,000 responses, you’d be better off putting that money into other things.

Actually doing power analyses can be pretty tough, and I’m putting a heavy gloss over the details here; if there’s interest I may dig into the gory details in a future post.

In contrast to power, which you can calculate ahead of time, whether or not the relationship you estimate is actually significant is something that can only be answered after you have collected the sample, run the model, &c. The best that you can know ahead of time is the chance that you will get a significant result (if it is warranted), and this is your power.

Unfortunately, in most cases, we’re not working on a presidential election, so we don’t have big poll aggregators backing us up with syntheses of other datasets. We have just one dataset, and just one model (or maybe a few) to answer the question in front of us. Fortunately, for most applications in market research, we don’t care about 5% differences; we care about the big differences that will make or break a product, advertising campaign, or pricing strategy.

The last thing in the world you want is to do all that data collection and modeling in vain, because your data simply cannot answer your question. Or, less commonly, the flip side, which is that you collected much more data than you needed to answer your question, and you could have done the project much more cheaply. Both situations can be avoided by doing power analyses ahead of time.

Applied Inference

How many N do we need to be significant?

On significance vs. power

Discussion about this post