Building off the post I wrote about the “three revolutions” in data science happening right now (deep learning, bayesian inference, and causal inference), I thought it might be fun to write down where my work intersects with these “revolutions”. In the original drafts of this, I called this a “quick post”, but pretty soon it became anything but quick. So instead of one post, I split it into three. The first one was last week, the second one is below, and I’ll share the last one next week.
Unlike deep learning, where there is only one specific application that I can highlight, choosing an example to highlight where we use Bayesian inference is more difficult, because it’s woven into more things that I do.
This discussion, which turned into this model, is illustrative: it’s in Stan (a language for probabilistic programming), it’s collaborative, and it’s a challenging model for marketing applications.
It’s exactly the type of model that to use in the past, you would have had to wait for smart academics to work out the estimation procedure and for someone else to write a software package. With Stan, PyMC3, Julia, etc, even amateurs like me can use (and extend, and remix) sophisticated models like these. And it’s even easier with the community that has spring up around them, as this case illustrates. Daniel Guhl, a really smart guy I’ve never met, helped me to implement a latent class multinomial logit over the Internet. Awesome experience.
What is a “latent class multinomial logit” model? Well, let’s say that you want to estimate how much your customers would be willing to pay for an additional feature. You could just ask them, but people are notoriously bad at answering these questions. Say you’re buying a car and you’re asked “how much would you pay to add lane departure assist?”. Most people really struggle to answer this type of question in the abstract and in isolation, much less to give accurate answers.
A better method (not perfect, but much, much better), is to give people hypothetical choices. You give people the choice between a Toyota Camry with lane departure assist at $25k, and another is a Subaru Outback without lane departure assist at $30k. With enough choices, you could regress the respondents’ choices on things like brand (Toyota vs. Camry), type (Sedan vs. SUV), features (lane assist vs. not), and price ($25k vs. $30k). By relating the pricing coefficients with the others, you can estimate “willingness to pay” for that feature. This is the “multinomial logit” part of the model. In market research we call these conjoint experiments.
But it’s never the case that everyone in the market is willing to pay the same amount for every feature, and often you want to figure out which combinations of features are particularly attractive to particular segments of the market. With multiple products you can capture more of the overall market than with a single product. The iPhone has five major versions, and each version has different configuration options. My guess is that without their cheaper options, they would lose significant share to Android.
But with the upside of offering multiple products is a major downside: cannibalization. If you make your cheaper version too attractive, consumers that would have paid for the pricier option will buy the cheaper option instead. Nailing this tradeoff requires fancy optimization routines in addition to bayesian inference. We’ll save the optimization side for another post.
So let’s say you’ve captured data from a few hundred or thousand respondents on hypothetical choices they would make. The “multinomial logit” part of the model will handle how much the respondents are willing to pay for each feature, and the “latent class” part of the model will handle who exactly is willing to pay for what.
Latent class models assume that the single dataset you have is actually a blend of several different datasets, each one with its own characteristics. For example, imagine that you were given a dataset that recorded people’s heights — but the person recording the heights happened to collect it either from three year-olds or 25 year-olds. After plotting the data, it’d be clear that this was actually two different datasets that had been merged together. With real-life datasets, it’s never that clear.
Even if it’s less obvious than the example above, statistics can “unblend” your data, and that’s exactly what latent class models do. After you’ve run your data through a model like this, you’ll have results along the lines of “33% of people are in the first segment, they’re more likely to be old than young, they’re willing to pay $10 for a widget but only $2 for a wodget, etc., etc.”
This is one of the most common projects that we do at Gradient, and it’s been made possible by reliable, modern, flexible, and well-supported Bayesian computation platforms.