Why data science needs qualitative research
We care about causal effects, but the data alone never supplies them
In my latest post, I said that “umbrellas don’t cause rain, even though you only see them around when it’s raining.” That example may have been too obvious for the point to land. To a person who knows that umbrellas are taken out by people to avoid getting wet when it’s raining, it’s clear that umbrellas don’t cause rain. But to a computer that doesn’t already know this causal mechanism, it would actually be impossible, from the data alone, to distinguish between a world where rain causes umbrellas and where umbrellas cause rain.
This is a phenomenon called “Markov Equivalence”, which basically states that, except for rare exceptions, you can switch around the arrows of causality and your data will “look” exactly the same.
What this means is that your knowledge of causality needs to come from outside the data. It is never in the data itself. Even in the seemingly trivial case of a controlled experiment, it is the researcher’s knowledge that treatment conditions were assigned at random that allows them to estimate the effect that the treatment had on the outcome.
As Nancy Cartwright puts it, “no causes in, no causes out”, or,
“one cannot get knowledge of causes from equations and associations alone”
Indeed.
Let’s take a more complex example. Let’s say you’re doing research on your customer base, and you assign a lifetime value to each customer. You’re doing some cuts of the data and you notice a curious fact: almost everyone that has a high lifetime value has bought the most expensive widget you sell:
Excitedly, you show this to your team and the company starts to advertise the Diamond Widget everywhere. It gets pride of place on the website, it gets discounts attached to it in email offers, and you buy advertising promoting the Diamond Widget everywhere under the sun.
Will this campaign work? Will the customers who now buy the Diamond Widget become high value customers? It depends on which of the following statements is true:
Buying Diamond Widgets causes customers to become high-lifetime-value customers
Whatever else causes someone to become a high-lifetime value customer also causes them to buy Diamond Widgets
In graphical form, the two options above look like the following:
If “other things” isn’t represented in your database, there is no algorithm known to humankind that can distinguish between these two possibilities.
Behind door one, the campaign is a success! The customers who buy the Diamond Widgets become loyal, high value customers.
Behind door two, the campaign is a failure! While some customers do buy some Diamond Widgets, they churn out at the same rates they did before. Back to the drawing board.
So, what can distinguish between the two graphs above? The title gives it away: qualitative research can. You know, the nitty-gritty of interviews, focus groups, ethnographic research, etc. Why does this help? It can answer simple questions like “why did you buy the diamond widget?” It can tell you what the causal mechanisms are between the variables listed above; in fact, it can help you understand what the “other things” even are to begin with, and you can start collecting that data too.
Using qualitative research to build the causal graph before doing quantitative analysis has not typically been the disposition of quantitative researchers, but I suspect with new tools (related to the revolution of causal inference) to formalize the output of qualitative research, that this will change. The instinct has typically been to run an experiment, because with experiments you know how the data was generated. But experiments are costly, sometimes prohibitively so, and they provide a limited window into reality.
What would this new way of doing things look like? The first stage of a project would focus not on numbers, but on variables and relationships: the things that matter and how they’re woven together. The output of this qualitative phase would be formalized as a set of candidate causal graphs. One graph representing the relationship between pre-game warmups and injuries is below (from Schrier & Platt 2008, via Dagitty).
With causal graphs in hand, then data collection and analysis can begin. With the data and the graph, causal effects can be estimated. It is no longer a situation of “no causes in, no causes out” because you have established (or at least hypothesized) the arrows of causality through qualitative research.
I am not saying this is the only interface between qualitative and quantitative research, but it is an important and underdeveloped one. One, importantly, that has a formalized set of outputs (causal graphs) that can be consumed as inputs by quantitative research.
Seems a better title for your article would have been “Why data science needs theory”. A convenience sampling of possible measures that come up in qualitative research might produce a better fit might seem satisfying, but I would recommend that one that (at least partially) uses and cites theory might produce better measurement, better models and a more grounded understanding of what is really going on.