How to assess the population average effect for longitudinal data
A key assumption underpinning generalized linear models (which linear regression is a type of) is the independence of observations. In longitudinal data this will simply not hold. Observations within an individual (between time points) are likely to be more similar than those between individuals.
So, how do you deal with this? One option is to fit a generalized linear mixed model in which there are random intercept and slope terms for each individual. This will tell you for a specific individual (i.e. conditional on the random intercept and slope) what is the effect of a variable on an outcome. However, this isn’t very useful if you are concerned with the marginal effect, i.e. what is the effect of a variable on an outcome on average in the population.
If you want to answer these population questions you need to fit a generalized linear model using generalized estimating equations (GEE). This is an approach that obtains the population average effect accounting for the fact that observations within individuals are likely to be more similar than those between individuals.
Suppose we have our outcome — all-cause mortality. Now suppose we record this every month for 10 months for every person. Now suppose our exposure, which is just time. We can now define a logistic regression model, with the sole independent variable being time (in months) and the dependent variable being death at that time. “Okay, great” I hear you say “but these observations are obviously not independent!”. Spot on, but we’ll come to that.
To use GEE we must first define how time points are related. However, by using Huber-White standard errors our results will be consistent even if we misspecify this relationship! So we have some choices.
This working correlation structure assumes that time points are independent of each other. This is probably an unreasonable assumption in practice.
This is where the correlation between observations at two time points is equal for any two time points. This is commonly used as it requires just one additional parameter α to be estimated.
This is where the correlation between observations follows an autoregressive structure. Suppose we were using an AR-1 correlation matrix. This would mean that the correlation between month 1 and 2 of a person would be expected to be α and the correlation between month 1 and 3 of a person would be expected to be α², between month 1 and 4 would be α³ and so on.
This is most appropriate when you think closer together time points are more similar than further apart time points.
This is where we estimate a separate α for each possible combination of time points. This is the most general case. Though you need a lot of data to be able to estimate all of the α used.
There do exist some other choices, but these aren’t widely used.
How to choose what one to use?
It’s simple. Either choose the most general one your data can support (depending on sample size) or you can choose one you think suits the data best. Either way, don’t sweat it! This approach is consistent even if you misspecify this.
Fitting the model is simple. We just fit a GLM using GEE with our specified working correlation matrix:
- Yij is 1 if participant i died at time j
- pij is the probability of death for participant i at time j
- β0 is the population average log odds of death at time 0. This can be exponentiated to obtain the odds of death at time 0.
- β1 is the population average difference in log odds of death associated with a one-month increase in time. This can be exponentiated to obtain the odds ratio associated with a one-month increase in time.
- Tij is the time of the j’th measurement for participant i in months.
That’s it. Our population average effect of a one-month increase on time increases the odds of death by an odds ratio of exp(β1).
Fitting a model in R
We can do this in R using geepack. Suppose our dataframe already existed with three columns
person.id all we have to do is:
mod <- geeglm(death ~ time, id = person.id, waves = time, family=binomial, corstr="exchangeable")
You can then just call
summary(mod) as normal and get your results!
And of course, you could add covariates to the model by just adding them to the formula. They can be time-varying or constant — either is fine!
Hopefully you’ve come away from reading this with a basic idea of GEE. They should be a tool in the toolbox of any data scientist working with longitudinal data.