December 05, 2025
By Teemu Säilynoja, Luca Fiaschi
A few weeks ago we released a benchmark comparing PyMC-Marketing and Google Meridian on synthetic but realistic media mix modeling datasets (here). That study showed clear, quantitative differences: PyMC-Marketing achieved lower contribution-recovery error, converged more reliably, and scaled more efficiently across dataset sizes.
Both libraries have since evolved. This follow-up evaluates the latest releases — PyMC-Marketing v0.17.0 and Meridian v1.2.1 — focusing on how Meridian’s new Automated Knot Selection affects predictive performance and recovery of media contributions. The datasets, generating process, and experimental conditions remain identical to the original study.
The central question is simple:
Does Meridian’s automated spline baseline reduce the performance gap, or do the structural advantages of PyMC-Marketing’s explicit seasonality modeling remain?
Meridian now includes an automatic procedure for selecting spline knot counts in its time-varying baseline.
This update clarifies some of the open questions we identified in the first benchmark. The question we will investigate in this article is whether this also improves contribution recovery.
We updated PyMC-Marketing to use the current recommended hierarchical priors for saturation parameters, aligning with the official example notebooks. These represent the standard starting point for multidimensional MMMs.
As suggested by some readers of our previous article in both libraries, we removed spend-share informed priors for this benchmark.
These priors are practical in real-world MMM, but in synthetic datasets historical spend does not reflect true effectiveness. Using neutral priors isolates each library’s ability to recover effects from data alone.
All datasets, noise structures, and true channel contributions are unchanged from the original study. The data generation code is now published as a standalone module to simplify user replication. The entire benchmark has been open sourced for reproducibility here.
The most important methodological difference remains unchanged:
This distinction is not cosmetic. It determines how much structure is learned vs. inferred, and how easily baseline variation can leak into media coefficients.
The new feature, Automated Knot Selection increases Meridian’s flexibility, introducing more splines and allowing its baseline to match observed data more closely. The scientific question is whether this flexibility now allows it to separate media effects as effectively as PyMC-Marketing’s explicit seasonal structure.
During testing, we fit Meridian with its log-normal and normal media-effect priors.
For fairness, we used the best-converged Meridian configuration for each dataset in all final comparisons.
This is a practical point for users: Meridian’s default prior choice may require adjustment for stability. To maintain our comparison as fair as possible, all adjusted to default parameters that we made were leading to improvements for Meridian in order to report the best results possible.
We examine three dimensions:
| Sampler | Start-up | Scale-up | Mature |
|---|---|---|---|
| PyMC-Marketing - nutpie | 9.38 | 1.59 | 1.71 |
| PyMC-Marketing - default | 13.23 | 0.69 | 1.13 |
| PyMC-Marketing - numpyro | 28.82 | 1.32 | Unsuccessful sample |
| PyMC-Marketing - blackjax | 27.19 | 1.74 | Unsuccessful sample |
| Meridian - tensorflow | 2.28 | 0.99 | 1.10 |
| Meridian - tensorflow (log-normal) | 2.19 | 0.07 | 0.76 |
Across small and mid-sized datasets, both libraries converged. But PyMC-Marketing’s samplers — especially NumPyro and BlackJAX — remained markedly more efficient.
PyMC-Marketing’s ability to switch between backends remains a practical advantage: teams can choose the sampler that best matches their dataset size and compute environment.
Also note that we did not test this again on the enterprise dataset as reported in the previous article Meridian does not converge on at all at that scale.
Automated Knot Selection significantly improved Meridian's in-sample predictive accuracy.
Because the automated routine tends to select a large number of knots, the baseline can track residual structure closely. On small and growing datasets, Meridian matched or exceeded PyMC-Marketing’s in-sample accuracy.
However, this flexibility comes with a caveat. Meridian’s residuals show stronger autocorrelation as measured by a relatively lower Durbin–Watson, meaning the baseline is capturing well short-term patterns rather than modeling seasonality explicitly. This could inflate in-sample fit while hurting out-of-sample performance, since forecasts rely on structure the model has not truly learned. While we have not yet tested out-of-sample performance in this benchmark, future versions will include those results. In contrast, PyMC-Marketing’s explicit Fourier seasonality leaves residuals closer to white noise, indicating better separation of trend, seasonality, and media effects.
| Metric | Library | Small Business | Growing Business | Medium Business |
|---|---|---|---|---|
| R² | PyMC-Marketing | 0.871 ± 0.020 | 0.884 ± 0.015 | 0.954 ± 0.006 |
| R² | Meridian | 0.930 ± 0.009 | 0.941 ± 0.004 | 0.968 ± 0.002 |
| MAPE (%) | PyMC-Marketing | 7.10 ± 0.60 | 6.70 ± 0.50 | 5.00 ± 0.30 |
| MAPE (%) | Meridian | 5.10 ± 0.40 | 4.70 ± 0.20 | 4.20 ± 0.20 |
| Durbin-Watson | PyMC-Marketing | 1.97 ± 0.17 | 1.85 ± 0.15 | 1.88 ± 0.14 |
| Durbin-Watson | Meridian | 1.89 ± 0.17 | 1.67 ± 0.09 | 1.42 ± 0.06 |
| CRPS | PyMC-Marketing | 267 ± 154 | 301 ± 188 | 5568 ± 3608 |
| CRPS | Meridian | 237 ± 215 | 325 ± 300 | 7488 ± 7109 |
Despite improved predictive fit, Meridian still underperforms PyMC-Marketing in contribution recovery.
This aligns with our observations from the original benchmark: a flexible spline baseline can improve in-sample accuracy while degrading causal separation between baseline and media effects.
| Metric | Library | Small Business | Growing Business | Medium Business |
|---|---|---|---|---|
| Bias | PyMC-Marketing | 82 ± 207 | 56 ± 158 | 522 ± 2175 |
| Bias | Meridian | 219 ± 259 | 49 ± 185 | 1664 ± 3317 |
| SRMSE | PyMC-Marketing | 0.41 ± 0.23 | 0.42 ± 0.34 | 0.16 ± 0.09 |
| SRMSE | Meridian | 0.66 ± 0.24 | 0.54 ± 0.49 | 0.29 ± 0.24 |
| CRPS | PyMC-Marketing | 145.39 ± 171.28 | 95.75 ± 139.13 | 1287.46 ± 1796.76 |
| CRPS | Meridian | 229.25 ± 311.69 | 97.54 ± 109.28 | 2265.50 ± 3050.04 |

When we examined contribution-recovery errors more closely, a consistent pattern emerged: channels that saturate quickly are inherently difficult to estimate. Practitioners often focus on ensuring enough variation in spend, but the key nuance is that spend variation must translate into variation in contribution. This only happens when spend moves within the effective (non-saturated) region of the response curve.
Once a channel is saturated, additional spend produces roughly the same outcome, breaking the correlation between spend and sales. If most of the observed variation occurs in this plateau, the model cannot distinguish the channel’s effect from the baseline, even if spend fluctuates. Without enough periods below the saturation point, the channel’s contribution becomes confounded with the intercept.
This identifiability issue makes early-saturating channels systematically harder to recover for any MMM approach. It also highlights an important practical limitation: accurate attribution requires sufficient variation in the part of the response curve where spend changes actually produce different incremental contributions. If anything, a more flexible baseline makes the identifiability problem more pronounced.
The updated benchmark highlights a more nuanced comparison between the two libraries as they continue to evolve.
In terms of sampling efficiency, PyMC-Marketing still holds a clear advantage. Across all dataset sizes, there is always at least one PyMC sampler that outperforms Meridian, with NumPyro and BlackJAX delivering 10–20x efficiency gains on smaller datasets and Nutpie providing strong performance at larger scales.
For in-sample goodness-of-fit, Meridian benefits meaningfully from automated knot selection. The more flexible baseline allows it to track observed sales more closely, sometimes matching or exceeding PyMC-Marketing’s predictive accuracy. However, the lower Durbin–Watson statistics indicate stronger residual autocorrelation, suggesting the baseline is absorbing short-term variation rather than modeling seasonality explicitly which can be problematic in some scenarios.
The key trade-off emerges in contribution recovery. PyMC-Marketing continues to outperform Meridian on all small, growing, and medium datasets. The gains in Meridian’s predictive fit do not appear to come from improved causal separation, but rather from the spline baseline capturing variation that should be attributed to media channels. In other words, the improved in-sample accuracy likely reflects a form of overfitting at the baseline level.
For practitioners whose goal is robust attribution rather than pure predictive accuracy, PyMC-Marketing’s explicit Fourier-based seasonality remains the more robust choice, offering clearer separation between trend, seasonality, and media effects.
Both libraries are maturing quickly, and each update brings meaningful improvements to the open-source MMM ecosystem. We encourage practitioners to experiment with the latest versions, replicate these results using the published data-generation module, and contribute to the ongoing refinement of these tools.
Curious to explore the models yourself? Everything (from the data generator to the full benchmark scripts) is available, and open sourced for reproducibility here . And if you’re considering an MMM overhaul or want guidance on the right modeling approach for your business, reach out. We love helping teams build measurement systems they can trust.