What even is a statistical model?

Jun 22, 2021

To answer the question we first have to ask (and answer) the question of “what is a model?”.

My definition of a model is a simulation of a subject that can be manipulated. Computer models of buildings, toy models of trains, financial models of businesses; all these share the commonality that they simulate something about the real world — Is the building structurally sound? What happens if I take a turn too fast? What will the cash flow statement of this company look like in a year?

Notice that I said “simulation of a subject”. This may have sounded weird (as it does to my ears), but the inclusion is important in a certain sense. Models of buildings truly are models of buildings, not of soil, not of weather, and so on, even though some of these may be related to whether or not buildings stay standing (and so may be represented in the model). The important thing for models of buildings is that they model the dynamics of the building correctly, and that is what makes buildings the subjects of such models. Models choose a subset of the world to simulate well, incorporating elements required to make a convincing simulation of the subject, and mostly ignore the rest.

The other key component of a model is that it must have inputs. A cash flow model of a business may take assumptions about the growth rate of a product line, or the success of an expansion plan into a new market. A computer model of a building may take the weight of the roof, location of supports, &c. Relatedly, these inputs must have some relationship to the outputs: “how much more valuable is this company if the expansion plan works well?”

I contend that statistical models are models of data. This may sound obvious but let me explain why it’s not. You may intuitively think, “well statistical models use data, so of course they’re models of data”. But that’s not quite right, because models of buildings and financial models of companies also use data, but data is not their subject. Models of buildings output simulated buildings, not simulated datasets. The subject of a model is what it simulates, not what it uses.

What I mean when I say that a statistical model is a model of data is that if you write down a statistical model, it will generate a simulated dataset for you, in much the same way that if you program in AutoCAD you will generate a simulated building, or in Excel a simulated business. Statistical models simulate datasets the same way that other models simulate their subjects. Now, it is not standard practice (yet) to use statistical models in this way, but that is what they’ll do if you ask them to, and I believe that it is their most natural, most primitive function. Everything else: estimating parameters, making predictions, and so on, is second-order to the core business of simulation.

The main difference, though, is that unlike the building designer who is using their programs to create a new building, the statistician is starting at the other end. They are not trying to design a new, beautiful dataset. They already have one and they have to figure out how it was made. What program, with what parameters, generated it. And this is why I believe that most people intuitively think of statistical models as using data to estimate parameters, as opposed to orienting their thinking the other way around, to thinking of statistical models as using parameters to simulate data.

Now, if you’re a Bayesian, you may have already reoriented your thinking this way. The way that Bayesians estimate parameters is to generate lots and lots of simulated datasets, pick out those that are most similar to their actual dataset, and average the parameters and programs that generated those. Okay, that’s not actually how MCMC/NUTS/Variational Bayes works, because in high dimensional spaces the “lots and lots” that you’d need is, well, lots. But you can think of it that way.

Okay, so now that I’ve written that long introduction, I realize I never described what a statistical model actually is, which at its most basic level, it’s a list of formulas that describe how variables are related and generated. That’s easy enough, but that doesn’t really explain what’s going on. I am running out of time to get this out the door, and I wanted to do a write up of the ranked choice vote NYC mayoral election that is ongoing, so that’ll come (hopefully) on Thursday.

Applied Inference

Discussion about this post