Applying deep learning to survey data at scale

My work and the "three revolutions"

Apr 20, 2021

Following my post about the “three revolutions” in data science (deep learning, bayesian inference, and causal inference), I thought it might be fun to write down where my work intersects with these “revolutions”. In the early drafts, I called this a “quick post” — but pretty soon it became anything but quick. So instead of one post, I split it into three. The first one is below, and over the next two weeks I’ll share the others.

We’ll start with the most visible of the revolutions: deep learning.

Bullseye

Deep learning has the smallest overlap with the work that I do -- but inside the overlap is an R&D project we’ve been working on for a bit over a year. The project has the codename “Bullseye”; at a high level, it uses deep learning to impute missing survey data.

Survey data imputation is boring and commonplace — what this technology does is take it to an entirely different scale. Typical surveys have something like 100 questions, and perhaps something like 10-20 of them may be missing answers. We’ve tested Bullseye on surveys that have 5,000 questions, of which only 50 of them (for any given respondent) are observed, so 4,950 are missing.

While this isn’t close to the scale of a Netflix or Spotify, it’s closer to the task they face when they recommend shows or new artists to you than it is to traditional survey imputation; because any given person has listened to only a very small fraction of the songs in Spotify’s catalog, only a very small fraction of the data is “filled in”, and their systems need to guess what you will like among vast range of artists that you’ve never heard before. We’ve taken the same approach they use and applied it to filling in missing survey data.

Why does this work? Why would anyone do this? It works because survey data is highly correlated; it is much, much more correlated than can be appreciated through typical bivariate crosstabs, which is the water’s edge for most survey researchers. Regardless, it turns out that when you have a few pieces of information about someone, predicting the rest is (often) a fairly straightforward task for deep networks.

Why anyone would do this, we’re less sure about. We are working on applications now, and have sold exactly one (1) commercial project using this technology (which was a success).

What we did was to take about 50 survey questions that were available in the TGI Dataset (which our client was using to create audiences to target) and randomly showed each survey respondent 5 of them. We then imputed the remaining 45 on the backend using Bullseye. This allowed our client to match the data they were getting via survey to other audiences created with TGI data, without expending valuable survey real estate on the TGI questions (which, aside from being used for matching, they didn’t care about).

What else could we do? Some ideas we have are:

Working with clients to collect far more data with survey than they could have before (e.g. say you were previously limited to 50 questions; with Bullseye we could collect data on 500-1000 questions, and impute the results)
Fusing multiple survey datasets into one megaset. Lots of our clients run many surveys on different themes, and have no good way of merging them together. This technology can be used to do data fusion
Maintaining our own database of survey data (say, on things that would be interesting to marketers who want to target specific audiences), and joining this on to surveys like concept tests, answering the question, say, of how to reach the 150 people in your 1000N survey who were “very interested” in your widget

If you have ideas for us or would like to conduct a pilot project using this technology:

Get in touch here

Applied Inference

Discussion about this post