Bayesian Media Mix Models: Modelling changes in marketing effectiveness over time


Benjamin Vincent



Here at PyMC Labs we've been working with one of the teams of data scientists at HelloFresh. This team is doing great work converting marketing dollars to newly acquired customers. They had developed their own Bayesian Media Mix Model which they described in a talk at PyMCCon 2020, but they recently challenged us to see what we could do with our expertise and experience with PyMC.


In the first blog post in this series, we outlined what Bayesian Media Mix Models (MMM's) are, how they worked, and what insights they can provide. In the second blog post in the series we summarised a range of improvements and deliverables. In this third installment we highlight a couple of major advances we’ve made that take MMM’s to the next level. Also make sure to check out Thomas Wiecki's PyData 2022 talk (Solving Real-World Business Problems with Bayesian Modeling) which covers a lot of the ideas in this series of MMM blog posts.

Time varying channel effectiveness

As a reminder from the second blog post, our saturation function maps a level of spend on a media channel to a number of new customers and is shown in the plot below. One of the nice features of our saturation function is that the parameters are interpretable. The $cac_0$ parameter is the inverse of the slope at the origin. It describes the customer acquisition cost of gaining 1 customer from an initial spend of zero, hence $cac_0$. We also have a parameter $S$ which describes the maximum number of users at channel saturation. These parameters have separate and independent effects upon the slope and the saturation points, which results in nice sampling properties, avoiding excessive parameter correlations that can occur with other saturation functions. An example with a given set of parameters is shown below.

novel reach function
Schematic of the $\tanh$ saturation function with $cac_0=0.5$, and channel saturation of $S=50$. As the $cac_0$ increases (not shown), the slope would get shallower, reflecting that it costs more to acquire new customers.

While this is great, it does make the simplifying assumption that that the world is unchanging - that if you spent the same on the channel then the number of new customers gained through that channel remain the same.

While this is a useful simplifying assumption made in all the MMM’s we have seen, we would be very lucky if the real world was so simple! The effectiveness of different media channels will likely change over time - consumer exposure to different channels and the effect of advertisements is likely to fluctuate over time. For example, Covid-19 lockdowns radically changed peoples’ commuting and work behaviours which would dramatically shift the effectiveness of different channels upwards or downwards. Failing to take the dynamic nature of the world into account would lead to biased estimates in our MMM and likely lead to suboptimal marketing decisions being made.

A Generative Process

Let's now describe how the underlying generative process of a time-varying CAC might look like. Bearing in mind that this is a simplified 1-channel model, we would describe our expected number of customers with our $\tanh$ saturation function:

$$ y[t] = S \cdot\tanh(x[t]/(S \cdot cac_0[t])) $$

Note how $cac_0[t]$ now also depends on time $t$, rather than being constant. It would be possible to model market saturation $S$ as changing over time as well, or instead of $cac_0$, but this is not the direction we took.

The schematic figure below visualizes this proposed data generating process for a single media channel. We have an underlying channel effectiveness (inverse CAC) which in real data analysis situations is unobserved (top left). The saturation function (right) relates the $ spend on this channel to the number of customers from this channel.

animated gif
A schematic plot of a putative data generating process for a single media channel. This channel has an unobserved and changing effectiveness over time (top left). The customers acquired from this channel (bottom left) is the output of our saturation function (right) and the spend on that channel (middle left). The saturation point was fixed at $S=100$ customers for this simulation.

You can see that as spend on the channel fluctuates quite a lot as we move up and down the saturation function. But on a slower timescale, the $cac_0$ changes which acts to shift the initial slope of the saturation function.

Gaussian processes

The equation above shows the data generating process, but when we analyze real marketing data we do not know how the channel effectiveness changes over time. Instead, our job is to infer this time-varying quantity $cac_0[t]$, and we do this using Gaussian Processes in PyMC. In brief, Gaussian Processes allow for scalable and flexible modeling of temporal dependencies.

The figure below shows a simplified example where we use the observed new customers and spends to infer $cac_0[t]$. We can see that in this example where the true $cac_0[t]$ is known (because we simulated the data) we can do a good job of inferring it from the spend and new customer numbers.

gaussian process
Schematic using Gaussian Processes to model $cac_0[t]$ based upon observed spends and new customers and the $\tanh$ saturation function. The dashed line shows the ground truth used to generate simulated data, and the red shows the inferred $cac_0[t]$.

Hierarchical Gaussian Processes

One approach would be to use Gaussian Processes to model the changing market effectiveness of each channel, independently. But this would fail to take into account our belief that there are larger scale temporal processes which can affect all channels. For example, seasonal effects as well as the dramatic changes in daily life caused by Covid-19 lockdown measures will clearly mean that marketing effectiveness is not totally independent across channels. How should we build this structural knowledge into our model?

Hierarchical Gaussian Processes to the rescue!

hierarchical gaussian process
A schematic of a hierarchical Gaussian Process. The top panel shows our global time-varying process of channel effectiveness. Bottom panels show advertising channel specific deflections from the global time-varying process.

This is similar to hierarchical modeling in a linear regression context. If we only had one point in time, we can imagine a group level effect (e.g. an intercept) and different channels have deflections from that intercept.

But using hierarchical Gaussian Processes we are able to model a global-level time varying channel effectiveness with the temporal correlations provided from Gaussian Processes, and channel-level Gaussian Processes to model any differences in effectiveness across different channels.


Implementing a vanilla Bayesian MMM is a doable task for a well equipped data scientist. But when we think about it, the kinds of inferences we are trying to make with MMM's are actually very challenging. So many things can influence customer acquisitions, and we have to try to disentangle the influence of advertising and background processes. So while it is possible to implement a Bayesian MMM, it is crucial to be quite skeptical of your model and really test if the inferences it makes are meaningful.

We found that ‘regular’ MMMs were predicting CAC values which were way too high, which indicated a problem in terms of which parameters or predictors in the model were accounting for customer acquisition. This overestimation of CACs has dire consequences for the business decisions made based on the model results. Marketing attribution is less effective and overall we would assume that our marketing is less effective than it actually is.

This is what motivated our extensive exploration away from vanilla Bayesian MMM's and towards alternative models. The final Bayesian MMM we delivered is state of the art in terms of model structure as well as out-of-sample accuracy. While we have made many improvements, the core contributions that we have discussed above are: Using Gaussian Processes to model CAC (in the saturation function) as a time-varying process. There are many reasons to suspect that CAC may change over time - incorporating this into an MMM could lead to better modelling the true data generating process. But this does of course result in greater model complexity and care must be taken to check that your data are informative about these extra model parameters.

We have taken a relatively high-level overview in this post, there are many more implementation details and experimentation steps which we glossed over. But for those who are hungry for even more detail, keep an eye on our Twitter profile and blog posts for any future updates.

Work with PyMC Labs

If you are interested in seeing what we at PyMC Labs can do for you, then please email We work with companies at a variety of scales and with varying levels of existing modeling capacity. We also run corporate workshop training events and can provide sessions ranging from introduction to Bayes to more advanced topics.