Contents

#### The bread and butter of applied statistics

Linear models are really, really important. While other linear models exist (hierarchical, proportional hazards, etc.), GLMs provide a great starting point.

First, the business aspect. For making recommendations and communicating results, interpretability is key. I can’t think of a more interpretable machine learning (ML) model than the GLM. Frequentists can test hypotheses using GLM to perform the likelihood ratio test, while Bayesians can get posteriors by fitting a GLM using priors.

Second, the technical aspect. Even when interpretability is not a concern, regularized GLM is perhaps the most commonly used ML model. It is often the “minimum viable product” for a data science project because it is simple to train and deploy while performing reasonably well. Booking.com detailed how they can productionize models quickly and at scale in this article (spoiler: it starts with GLMs). If you have many possible projects to do, it can be more worthwhile to stick with GLM and return to improve the models only after you’ve gone through all the high-priority projects.

A few years ago there was a viral tweet:

When you’re fundraising, it’s AI.

When you’re hiring, it’s ML.

When you’re implementing, it’s linear regression.

When you’re debugging, it’s printf().

And people sarcastically joke:

While these claims are not entirely factual, there is a ring of truth in it. Sure, many jobs require the opposite end of the spectrum: deep learning. Good luck trying to do NLP or computer vision using linear regression. But isn’t deep learning just a bunch of logistic regression smushed together? (I jest.)

Hopefully you are convinced that GLM is an important skill to master in analytics and data science. I want to distill the essence of GLMs into this short (ok, lengthy) article so that you can have a good understanding by investing 20 minutes of your time rather than reading an entire textbook.

This article will first talk about logistic regression because it is the most common and familiar type. It will then be used as an example to explain the overarching GLM concepts. Then we will dive into other types of GLMs. Knowledge of ordinary least squares (OLS) regression is assumed.

### Binary Outcomes: Logistic Regression

Logistic regression is used for binary responses encoded as 1 or 0, true or false. When aggregated, we might have proportions or counts of successes as the response variable. We assume the data is generated like coin flips (binomial distribution) and we want to predict the probability of success conditional on the predictors.

We will use the Space Shuttle Challenger O-rings dataset for demonstration. In 1986, the space shuttle disintegrated shortly after launch, killing all its passengers, because of O-ring malfunction.

The incident was preventable. Engineers had voiced concerns and strongly recommended that the launch be delayed because of weather conditions, but the warnings were ignored by NASA management. The night before the launch, engineer Ebeling told his wife “It’s going to blow up.”

How did the engineers know? Here’s the data they had (points slightly jittered):

At launch, the external temperature was 31F. Even by visual inspection, we should be alarmed. The engineers asked to wait until temperature rose above 53F. NASA believed a 0.001% chance of launch failure. How risky was the launch, really?

The most naive way to answer the question is to use OLS regression to predict probabilities:

However, this model doesn’t make sense. For starters, it can predict probabilities outside of [0,1], or in this case, expected counts outside of [0,6]. Instead of fitting a straight line as with OLS, logistic regression fits a logistic function, a form of sigmoid (S-shaped) function:

We can set a lower and upper limit to our predictions. As it turns out, predicting the log odds (logit) and feeding it into the logistic (inverse logit) function achieves our goal. Our regression model is:

Note the absence of ε. The irreducible error comes from the binomial distribution, which is a function of p.

In R, we can fit the model with a single function:

model <- glm(

damage/6 ~ temp,

data = faraway::orings,

family = 'binomial',

weights = rep(6, nrow(faraway::orings))

)

There are two ways of inputting grouped binomial data into R’s glm. I prefer this one, where the response is the proportion and the weights are number of trials. The predicted means, with a 95% confidence interval:

Based on the model assumptions, we expect that on average 5 or 6 of the O-rings would’ve failed in the launch conditions. The launch was doomed.

Now that we have seen our first example, we will discuss general aspects of GLM before moving on to other distributions.

#### Theory: Link Functions

GLMs predict link functions instead of the raw observations. Think of them as a way of linking your observations to your model.

The best way to explain a link function is to contrast it against a transformation function. The log-transform is often used to deal with right-skewed residuals in OLS regression to predict E[ln(y)|X], the conditional expectation of a function. Using Gaussian GLM with the log link instead predicts ln(E[y|X]), **a function of the conditional expectation**.

The difference becomes clearer when we see the regression equations with ε following the normal distribution by assumption:

- log transform: ln(y) = Xβ + ε. Conditional distribution of y is
*lognormal*. - log link: ln(y + ε) = Xβ. Conditional distribution of y is
*normal*.

In log transform, exp(fitted) isn’t the predicted mean but the mode. Neither is better than the other: they make different assumptions.

Link functions display their awesomeness in count and binary observations. We often use the log link function to model counts. How often have you been frustrated at being unable to take log(0)? Under the Poisson assumption, you can observe a 0 even when the conditional mean is nonzero. Taking the log of the conditional mean poses no problem.

The canonical link function in logistic regression is the logit (log odds) function

We say canonical because it is the one we get from the binomial distribution itself. The logit function is a convenient way to transform (0,1) → R, but there are many functions to choose from. Why this one?

See anything familiar? We call logit(p) the natural parameter of the binomial distribution because it’s the parameter that interacts with y. Exercise: do the same with Poisson to show why ln(λ) is the canonical link.

Of course, nobody is forcing you to use the canonical link. For instance, the probit link is often used to model rare events.

#### Theory: How GLMs Work

There are several algorithms to fit a GLM but this article will talk only about iteratively reweighted least squares (IRLS) because it gives the best intuition. Viewing it through the lens of gradients is too abstract. Some of the math will be waved away so we can focus on the intuition. We start with the closed form solution of OLS:

In OLS, all of the observations are assumed to be independent and the errors are drawn from the same normal distribution with a common variance. If the observations were not equally weighted, the solution would look like:

In OLS, W is the identity matrix and is therefore omitted. But what if it is not the identity, as in weighted least squares? We keep the independence assumption, so W is a diagonal matrix. Each entry on the diagonals should be the precision, or 1/variance.

This last point is a common and important theme in statistics. So many things in statistics can be viewed as a **precision-weighted average**. When we are really sure about something (variance is small), we want to give it a high weight when computing our estimate — precision tells us how much information is contained in each observation.

As a more concrete example, we know that the standard error of the sample mean s can be computed as

What if we have grouped data, and we assume each individual observation comes from a group-specific normal distribution that shares the same variance? If we have three groups, we can compute the global mean as

In this case, the precision-weighted mean is the exact same thing as a regular weighted average. Keep this intuition in mind. The W in GLM for non-normal distributions gets more complicated but the entries are still precisions.

The next step is to replace y with the working dependent variable z. We are not trying to predict y directly, so we need to modify it to work with the link function.

The e here is the actual prediction error. In our logistic regression example, if Xb = 1, then our predicted probability is inv.logit(1) = 0.73. If we observe a success, then e = 1 – 0.73 = 0.27. The σ is from our distribution of choice. Since logistic regression uses binomial, σ² = p(1 – p).

What if we assume the normal distribution with identity link g(x) = x? Then the final term is

which is just the **standardization formula**! So our working dependent variable is our predicted value plus how many standard deviations away is our actual value.

We are armed with 1) weights to take our weighted average and 2) a “standardized” variable on which we want to take the average.

The final step is to realize that **W and σ depend on the predicted value and vice versa**. The two equations above look recursive, huh? In fact, the sample mean and sample variance are independent if and only if the samples come from a normal distribution. This is a profound statement — no distribution other than the normal has this property, which is why we can solve OLS in a single step.

We can think of IRLS as an expectation-maximization algorithm. We hold the predicted values fixed and compute the W and z. Then, hold W and z fixed to compute the new predicted values. Iterate until convergence.

And there we have it. Intuitively, GLM is an algorithm to take a precision-weighted average in a smart way.

#### Diagnostics: Deviance

The deviance, -2 log likelihood, should follow a χ² distribution if the model is a good fit. As a rule of thumb, if deviance is too far above the model degrees of freedom, then the model might not fit the data well. Be careful when working with grouped data as R outputs the incorrect deviance and you should calculate it manually. Simple demonstration:

dat <- data.frame(

y = 5:7 / 10,

n = rep(10, 3),

x = 1:3

)

model <- glm(

y~x,

data = dat,

weights = n,

family = 'binomial'

)

summary(model)

R mistakenly thinks that 3 rows = 3 observations even though we have 30 observations. The residual deviance should have 28 degrees of freedom.

binom_deviance <- -2 * sum(

log(choose(dat$n, dat$y * dat$n)) +

log(fitted(model)) * dat$y * dat$n +

log(1-fitted(model)) * (1-dat$y) * dat$n

)

pchisq(

q = binom_deviance,

df = sum(dat$n) - length(model$coefficients),

lower.tail = FALSE

)

Because the p-value is huge, 0.9999, we fail to reject the null hypothesis that the model is a good fit.

#### Diagnostics: Pearson Residuals

To visually inspect a GLM, I suggest looking at the Pearson residuals, which is the residual divided by the theoretical standard deviation, like sqrt(p × (1-p)) for the binomial and sqrt(λ)** **for Poisson. Because the variance depends on the fitted value, doing this gets all the residuals on the same unit scale and you can eyeball using the empirical rule of a standard normal. Only a few Pearson residuals should fall outside of (-2, 2) and very rarely should they fall outside of (-3,3).

set.seed(123)

x_val <- rnorm(100)

y_poisson <- sapply(

x_val,

function(x) rpois(n = 1, lambda = exp(x))

)

y_negbin <- sapply(

x_val,

function(x) rnbinom(n = 1, size = 0.8, mu = exp(x))

)

model_correct <- glm(y_poisson~x_val, family = 'poisson')

model_wrong <- glm(y_negbin~x_val, family = 'poisson')

plot(

fitted(model_correct),

residuals(model_correct, type = 'pearson'),

main = 'Good fit',

ylab = 'Pearson residuals',

xlab = 'Fitted values'

)

In general we do not expect Pearson residuals to follow the normal distribution because the conditional distribution of error terms is typically skewed — counts cannot go below 0, and probabilities are bounded between 0 and 1. Thus, a normal QQ plot is not helpful at face value.

However, if you take the absolute value of Pearson residuals, they should resemble a half-normal distribution if the model is a good fit. A quick and hacky way to do it is to make a second copy of the Pearson residuals and flip the sign and do a normal QQ plot.

double_resid <- c(

residuals(model_correct, type = 'pearson'),

-residuals(model_correct, type = 'pearson')

)

qqnorm(double_resid, main = 'Good fit')

qqline(double_resid)

#### Hypothesis Testing: Likelihood Ratio Test

Testing with GLMs can be done through the LRT. Check my previous article for a detailed explanation. The basic idea is we want to compare two models, one nested within the other. The smaller model’s predictors are a subset of the larger model’s. The family, data, link function, etc. have to all be the same. We compare the difference in deviance to the χ² distribution with df = number of additional parameters estimated:

dat <- data.frame(

x1 = rnorm(10),

x2 = rnorm(10)

)

dat$y <- dat$x1 + dat$x2 + rnorm(10)

model_small <- glm(y~x1, data = dat)

model_large <- glm(y~x1+x2, data = dat)

pchisq(

q = model_small$deviance - model_large$deviance,

df = model_small$df.residual - model_large$df.residual,

lower.tail = FALSE

) # Likelihood Ratio Test

summary(model_large)$coefficients # Wald test

In this instance, LRT computes the p-value of H0: coef(x2) = 0. It is preferable to the p-value contained in the regression summary, which is a *convenient approximation* through the Wald test (and, depending on the regression formula, might not even answer the right question). Bayesians can draw the parallel to when quadratic approximation fails.

#### In Practice: Regularization and Cross-Validation

When you care about p-values and unbiased estimates (under model assumptions), you use vanilla GLM. But if you care about making **better predictions**, you use regularized GLM. And when you have more predictors than observations (p > n), you are forced to use regularization anyway because the MLE does not exist.

Talking about regularization is hard because different packages have different formulations. Although the end results are equivalent, the parameters are defined differently (C or λ). I will stick with R’s glmnet documentation.

GLM minimizes deviance. Regularized GLM minimizes deviance (multiplied by a constant that I’ll wave away) + a penalty term:

- lasso adds the L1 norm as penalty. You sum the absolute value of the coefficients (except the intercept) and multiply it by a constant λ.
- ridge adds the L2 norm as penalty. You sum the squares of the coefficients (except the intercept) and multiply it by a constant λ.
- elasticnet takes a sort of weighted average of the two penalties. You pick a constant α and set the penalty term as α × L1 + (1-α)/2 × L2.

The L1 norm encourages sparsity and can set many coefficients to 0, while ridge shrinks coefficients towards 0 but never gets there. Think of the λ term as setting a “budget” so each coefficient cannot spend too lavishly (being huge in magnitude).

You need to standardize the predictors (glmnet does this by default) to fairly allocate the budget. If one predictor is in g while another is in kg, it might make more sense to give a bigger budget to kg. (How many cakes can we produce? Maybe +1 for each kg of flour but +0.001 for each g of sugar.) Standardization puts all predictors on the same scale.

One thing to note that many people might miss: with regularization, you should encode all levels of the predictors. If you have a categorical predictor with k possible values, traditional dummy encoding will drop one as a reference level and then create k-1 indicator columns. With regularization, **you should create k indicator columns**. The intercept is not penalized. Because the dropped level gets absorbed into the intercept, the coefficients will change depending on which level gets dropped — undesirable.

We still need to set the hyperparameters λ and α. The simplest approach is to use grid search for α (because it’s between 0 and 1, it’s simple to do evenly spaced intervals) and k-fold cross-validation for λ. Specific details about CV can be found elsewhere.

One major difference I notice between statistics and computer science is in how they choose their λ (or C) from CV. Implementations in Python return the λ that yields the minimum objective function (lambda.min) and you don’t have a choice. In R, you have a choice between lambda.min and the largest λ that is within one standard error of the minimum (lambda.1se).

By the law of parsimony (Occam’s razor), some prefer lambda.1se as it results in a simpler model that performs about as well as lambda.min. Also, lambda.1se tends to be more stable. Re-randomizing the data into the k folds can yield wildly different lambda.min but more similar lambda.1se.

All talk can get confusing, so let’s fit a lasso to see it in action. R code:

library(glmnet)

# The 0+ makes Species not drop a reference level

X <- model.matrix(Sepal.Length~0+., data = iris)

y <- iris$Sepal.Length

set.seed(123)

train <- sample(1:nrow(X), floor(nrow(X) * 2/3))

model_cv <- cv.glmnet(X[train,], y[train])

plot(model_cv)

model_1se <- glmnet(

X[train,],

y[train],

lambda = model_cv$lambda.1se

)

model_min <- glmnet(

X[train,],

y[train],

lambda = model_cv$lambda.min

)

mse <- function(pred, actual){

mean((actual - pred)^2)

}

# while sometimes model_min yields lower MSE

# generally model_1se performs better in this example

mse(predict(model_1se, X[-train,]), y[-train]) # 0.09685109

mse(predict(model_min, X[-train,]), y[-train]) # 0.1022131

Yeah, you can groan because we used the Iris dataset. We want to predict Sepal.Length using all the other variables. The plot() function returns

Here log(lambda.min) is around -7 while log(lambda.1se) is around -5, as indicated by the dotted lines. Feel free to try different seed values to compare the MSE, but typically lambda.1se performs better. (I don’t know why this isn’t the default instead of lambda.min in Python. Manually writing the cross-validation to get lambda.1se is a pain.)

I split the data into training and testing sets to illustrate performance on held-out data, but in practice you should *not *do this split. The errors from CV are more trustworthy than from the test set. Do CV on the entire data to pick λ and then refit on the whole dataset.

I want you to pay attention to the estimated coefficients:

The coefficient for versicolor is the one that is set to 0 because it’s the middle value. Lasso has learned to absorb versicolor, instead of the other species, into the intercept.

Intuitively, if {virginica, versicolor, setosa} is {1, 2, 3} and we drop versicolor, then the MLE coefficients will be {-1, /, 1} and they get shrunk equally. But if we drop setosa because we go by alphabetical order, then the MLE coefficients will be {-2, -1, /} and the -2 will be shrunk more than the -1. This doesn’t make sense as nothing about the data or the real world has changed. We don’t want predictions to depend on which level we manually drop.

Now that we have a solid understanding of GLMs, we’ll continue with other types. These will be covered only briefly because they build on top of what we have previously discussed.

### Count Data: Poisson and Negative Binomial Regression

Poisson uses the log link so the regression equation looks like

As previously mentioned, the magic of the log link allows us to take ln() even though counts can be 0.

Sometimes, we are interested in the rate instead of the counts because of different exposure units. If we want to model how many people have gotten sick in each city, a city with 100,000 inhabitants should have more than a small city with 1,000 inhabitants due to population size alone.

In logistic regression, we used the weights argument, but in Poisson regression the weights should virtually never be touched. Instead, we use an offset. Where y is the count of incidents, we rearrange the regression equation:

An offset simply means we don’t estimate the slope but fix it at 1, such as ln(n) in the RHS. In R:

glm(y~x+offset(log(n)), family = 'poisson')

In practice, Poisson regression is not enough. A lot of data exhibits overdispersion — the variance of observations is higher than suggested by the Poisson distribution. This can be discovered by examining the Pearson residuals; a lot of them will lie outside of (-2, 2).

I suggest trying out Poisson regression first. If there is evidence of overdispersion, try negative binomial regression, which also uses log link but has an overdispersion parameter for the variance. In R, this is the glm.nb() function.

The negative binomial is really neat. If you sum up Poisson distributions, where each one has a λ that varies according to the Gamma distribution, then you get the negative binomial. In other words, we explicitly model that even though two units might have the same X, their λ** **might be different. It’s a good way to explain the overdispersion, as you have the variance from both the Poisson and Gamma.

Sometimes you have way more zeros than would be suggested by Poisson or negative binomial. If you have an idea about the data generation process, you can use zero-inflated regression or a hurdle model (though these are computationally expensive). Refer to this thread for intuition.

### Contingency Tables: Log-Linear Models

If we have a bunch of Poisson random variables and constrain their total count to be fixed, then we get the multinomial distribution. So even though contingency tables are intuitively modeled as multinomial, we can use Poisson regression to do the job. And you should be glad, because the multinomial is a pain.

Log-linear model is just a fancy term for Poisson regression where all the predictors are categorical — in this case, whatever categories are in your contingency tables. I’ve written about this before, so please refer to the section about G-test if you’re interested in more.

### Categorical Outcome: Multinomial Regression

Multinomial regression is logistic regression that is extended to predict more than binary outcomes. In fact, it is just a couple of logistic regressions bunched together. In logistic regression we predict the log odds, where the odds is p/(1-p). In multinomial regression, we do the same thing except pick a reference category to compute the odds.

Suppose we have three categories A, B, and C. If we use A as the reference category, then we fit two logistic regression models:

The first model is trained only on data where the response is either A or B, and the second model is trained only on data where the response is either A or C. We pass the predictions into the softmax function to get

The 1 looks so out of place, but remember that log(pA/pA) = log(1) = 0.

The softmax function is just the logistic function that’s generalized to more than two categories. It is often used to conveniently turn predictions into probabilities, but here it is more than convenient: it makes sense. Suppose we have no predictors other than the intercept, and we observe 70 A, 20 B, and 10 C. Then:

With MLE, it does not matter which category you choose as the reference category. With regularization, it does matter if you do it this way, so regularized multinomial regression is not typically implemented as a bunch of logistic regressions.

All this talk about the sigmoid and softmax functions sounds like neural networks. Indeed, the most primitive neural networks were a bunch of logistic regressions bundled with softmax functions, though the field has progressed a lot since then.

### Ordinal Outcome: Ordinal Regression

Personally I find ordinal regression quite unintuitive. You want to encode the y as 1, 2, 3, …, k and want to predict the probability of y = i where i is an integer between 1 and k. We use the cumulative link function:

Our regression equation looks like

where i only goes up to k-1. Remember P(y > k) is 0. In the regression equation, β** **does not include an intercept. And we make predictions using a basic fact

Ok, so what do all these equations mean? We are fitting a sigmoid function to predict the CDF conditional on X. In the regression equation, β is fixed for all i, but θ changes with i. When X is fixed, we fit a single function while θ determines the cutoff points.

Here’s the problem and intuition. Probability must sum to 1. Imagine a stick of length 1, and we have an ordinal variable y in {1, 2, 3, 4, 5}. We want to break the stick into 5 pieces where the length of each piece corresponds to the probability of falling in a certain category. The simplest solution is to pick 4 points to make the cut: the k-1 thetas.

### Further Resources

I’m a huge fan of this course website, which has served as a good companion on my self-learning journey. And as always, please let me know if you have suggestions for improvements.

A Primer on Generalized Linear Models was originally published in on Medium, where people are continuing the conversation by highlighting and responding to this story.