Implementing these can be simpler than you think
There seems to be a general misconception that Bayesian methods are harder to implement than Frequentist ones. Sometimes this is true, but more often existing R and Python libraries can help simplify the process.
Simpler to implement ≠ throw in some data and see what sticks. (We already have machine learning for that. :P)
People make Bayesian methods sound more complex than they are, mostly because there’s a lot of jargon involved (e.g. weak priors, posterior predictive distributions, etc.) which isn’t intuitive unless you have previously worked with these methods.
Adding more to that misconception is a clash of ideologies — “Frequentist vs. Bayesian.” (If you weren’t aware, well, now you know.) The problem is people, especially statisticians, commonly quarrel over which methods are more powerful when in fact the answer is “it depends.”
Bayesian methods, like any others, are just tools at our disposal. They have advantages and disadvantages.
So, with some personal “hot takes” out of the way, let’s move on to the fun stuff and implement some Bayesian linear mixed (LMM) models!
Here’s what I’ll cover (both in R and Python):
- Practical methods to select priors (needed to define a Bayesian model)
- A step-by-step guide on how to implement a Bayesian LMM using R and Python (with
- Quick model diagnostics to help you catch potential problems early on in the process
Bayesian model comparison/evaluation methods aren’t covered in this article. (There are more ways to evaluate a model than RMSE.) I’ll publish a subsequent article covering these in more detail.
If you are unfamiliar with mixed models I recommend you first review some foundations covered here. Similarly, if you’re not very familiar with Bayesian inference I recommend Aerin Kim’s amazing article before moving forward.
Let’s just dive back into the marketing example I covered in my previous post. Briefly, our dataset is composed of simulated website bounce times (i.e. the length of time customers spend on a website), and the overall goal was to find out whether younger people spent more time on a website than older ones.
The dataset has 613 observed “bounce times” (
bounce_time, secs) collected across 8 locations (
county), each with an associated
age. However, the number of observations per location varies (e.g. one has 150 observations while another has only 7).
EDA is the unsung hero of data analysis and you shouldn’t think of it as “just plotting data.”
(I’d stay away from posts implying it’s the “boring stuff” or those just looking to automate it so you can dive straight into modeling, even if you’re using an ML algorithm. If you’re doing so, you’re missing out on a very powerful tool.) Personally, I think it is one of the most crucial steps in your analytics workflow. It can work as an iterative tool to help you adapt your modeling strategy, especially when your strategy is to start simple and build-up to models with higher complexity.
EDA is more than just plotting data, it acts as an iterative tool that can help you adapt your modeling strategy.
For example, in our case the simplest model we can fit is a basic linear regression using
sklearn (Python) or
lm (R), and see how well it captures the variability in our data.
We could also consider a more complex model such as a linear mixed effects model. Again with some EDA we see that such a model captures group variability better and thus might be a better strategy. We can use the
seaborn.lmplot or ggplot2’s
geom_smooth to quickly build some intuitive EDA plots. Here it seems that a varying-intercept, and a varying-intercept / varying-slope model might be good candidates to explore. (Our EDA also brings to light an example of Simpson’s Paradox, where a global regression line doesn’t match the individual county trends.)
We’ll proceed with three candidate models: a linear regression, a random intercept model, and a random intercept + slope model.
(The simple linear model is considered a baseline and will help exemplify the modeling process.)
NOTE: At this point I’m assuming you’re familiar with Bayesian inference and won’t dive into its main components. If you’re not, stop, and go checkout the suggested readings above.
Bayesian inference requires us to select prior distributions (“priors”) for our parameters before we can fully define our model. Rather than be tempted to use uninformative priors — e.g. normal distributions with extremely large variance— to avoid prior assumptions damping down signals from the data or adding bias, you should aim to select weakly informative priors.
But how can you define what a “weak” prior is?
Good weakly informative priors take into account your data and the question being asked to generate plausible datasets.
A common strategy to define weakly informative priors is to create a “flip-book” of simulated data from your likelihood and prior combination. (This procedure is referred to as a prior predictive check.) Instead of thinking of priors as standalone (marginal) entities, simulating data with them allows us to understand how they interact with the likelihood and whether this pair generates plausible data points.
For example, let’s try choosing weak priors for our random intercepts model parameters (β0, β1 ,b_0 ,τ_0, and σ² ) and simulating some data.
I chose the above priors via an iterative strategy that combined information from the EDA step on population and group trends with group-specific empirical mean and variance estimates (see below). You’ll have to explore some options based on your own data and the questions being considered.
A common strategy to define weakly informative priors is to create a “flip-book” of simulated data (also referred to as a prior predictive check).
Below is an example of what a flip-book might look like (left animation). We see that most of our simulated data points are “reasonable,” meaning we see most fake points fall within our current bounce rate range (≈160–250). Others are outside the range of our data but still within a believable range (e.g. up to 500–600 secs). If instead we had used generic weak priors [e.g. N(0, 100) and Inv-Gamma(1,100)] we’d see a lot of simulated bounce rates <100 sec. or even < 0, which is obviously less probable (right).