Did you control for _?

It's complicated

May 18, 2021

An extremely common question we get, both with Gradient and Recast is: “did you control for __?”, where ___ is any particular variable of interest. Think things like: covid, the respondent’s political orientation, etc.

I wish we could always just say “yes, we were diligent and thoughtful and controlled for that variable, too”, but unfortunately what you control for in a statistical model is much more complicated than “the more the better”.

Let’s take a motivating example about height and basketball performance.

Let’s say you’re a researcher and you’re interested in how being tall makes you better at basketball. You look around for a source of statistics on this and realize that the NBA has everything you need: player heights are listed and known, and all sorts of statistics are available to represent player performance.

You plot points per game against height and you find out this surprising fact:

Matthew Hahn @3rdreviewer

You can be a professional basketball player, no matter how tall you are! No correlation between height and scoring success in the NBA:

Huh. This is counter to all intuition! Height should definitely be correlated with basketball performance — we all know it is — so why isn’t it??

Let’s say you have the following model of the world: basketball ability is a combination of height, talent, and work ethic. Getting into the NBA is purely a function of your basketball ability. This would look something like this:

In statistical form, we can run a quick simulation of by doing the following:

N = 1e3
height = rnorm(N)
talent = rnorm(N)
workethic = rnorm(N)
basketball_ability = height + talent + workethic + rnorm(N)
nba = basketball_ability + rnorm(N, -3)

Let’s plot the relationship between height and basketball ability:

Okay great! We got the relationship we wanted! Being taller makes you better at basketball.

Now let’s zoom in on the top of the graph, and look at those that got into the NBA:

Huh, maybe there’s some relationship here, but it’s not at all clear that there is, or how strong it would be. Why is that?

The reason is that by subsetting to only those players that got into the NBA, we’ve “controlled” for that variable. And since getting into the NBA is a function of basketball ability, we’ve controlled for the very thing that we care about!

Let’s take another example; one of a million that you can find if you search the internet for “collider bias”. Let’s say you, like many people, survey your dating history and find that of the people that you’ve dated in the past, they were either very nice and not that attractive, or very attractive and a bit mean. You conclude, based on your experience, that people “out there” are either nice and plain, or attractive and mean. Is this the right conclusion?

It is not, because you’re “conditioning” on your own experience. Just as in the above, where we subsetted the graph to those simulated players that “made the NBA”. In your experience, you are subsetting the space of people to those that you’ve dated. And you probably require that meaner people compensate in other ways, such as being more attractive.

Graphically, this looks something like the following:

From your perspective, it sure looks more attractive people are less nice, and vice versa. But that’s simply not true among the population at large.

This is called “conditioning on a collider” because you are “conditioning” (controlling for) a variable where causes “collide”. Graphically you can see the two arrows of “Niceness” and “Attractiveness” colliding at “Dateability”

Colliders are common effects. When you control for common effects, you create a correlation where none existed. If you date someone and they’re mean, there must be some other reason you’re dating them! The “there must be some other reason” creates a correlation between niceness and attractiveness where there was none before.

So if you actually included a variable called “Dateable” in your model, you’d get a very confident negative coefficient on Attractiveness’s relationship with Niceness:

Call:
lm(formula = Niceness ~ Attractiveness + Dateable, data = sims)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5336 -0.6512  0.0051  0.6693  3.3867 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.033695   0.009890  -3.407 0.000659 ***
Attractiveness -0.042333   0.009713  -4.358 1.32e-05 ***
DateableTRUE    1.889178   0.077134  24.492  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9767 on 9997 degrees of freedom
Multiple R-squared:  0.05681,	Adjusted R-squared:  0.05663 
F-statistic: 301.1 on 2 and 9997 DF,  p-value: < 2.2e-16

This, of course, is all wrong: As we can see from the cloud of points above, there is no correlation in the population.

These are toy examples, but they illustrate a pernicious and common problem when building statistical models: figuring out what to control for. Commonly this problem is discussed when the goal is estimating causal effects, but even in the case we just laid out, where we are simply trying to estimate a correlation, one can see how quickly and easily we are led astray by including the wrong things in the model.

It is commonplace among junior data scientists and amateur statisticians to control for everything they can get their hands on, or to use variable selection algorithms like LASSO to choose the variables for them.

Richard McElreath (perhaps via Judea Pearl) calls this “causal salad”: a bunch of variables thrown together without much of an understanding why. This is a recipe for disaster: including or excluding the wrong variables can take away correlations where they existed, and create ones where they did not.

This is one of the many reasons it’s important that the modeler has a good qualitative understanding of the problem (which variables are relevant, and how they are related), before they start building regressions. This is why data science needs qualitative research.

Once you’ve written down all your variables and drawn arrows from causes to effects, it’s trivial to figure out what to control for. A computer can do that part for you, but it cannot do the first part (drawing the graph in the first place). If you’re looking to understand it a bit better, this is a good primer on good and bad controls in regression.

Applied Inference

Discussion about this post