I’ll admit I spent much of my first-semester of probability theory just struggling to understand the difference between X and x. When I finally learned all the rules for expectations of random variables, I still had zero appreciation for their implications in my future work as an applied statistician.
I recently found myself in a rabbit hole of expectation properties while trying to write a seemingly simple function in
R. Now that I have the output of my function all sorted out, I have a newfound appreciation for how I can use regressions – a framework I’m very comfortable with – to rethink some of the properties I learned in my probability theory courses.
In the function, I was regressing an outcome on a few variables plus a grouping variable, and then returning the group means of the fitted values. My function kept outputting adjusted group means that were identical to the unadjusted group means.
I soon realized that for what I needed to do, my grouping variable should not be in the regression model. However, I was still perplexed as to how the adjusted and unadjusted group means could be the same.
I created a very basic example to test this unexpected result. I regressed a variable from the
iris data set,
Sepal.Length, on another variable called
Sepal.Width and a grouping variable
Species. I then looked at the mean within each category of
Species for both the unadjusted
Sepal.Length and fitted values from my linear regression model for
# fit a linear regression for sepal length given sepal width and species
# make a new column containing the fitted values for sepal length
mutate(preds = predict(lm(Sepal.Length ~ Sepal.Width + Species, data = .))) %>%
# compute unadjusted and adjusted group means
summarise(mean_SL = mean(Sepal.Length), mean_SL_preds = mean(preds)) %>%
I saw the same strange output, even in my simple example. I realized this must be some statistics property I’d learned about and since forgotten, so I decided to write out what I was doing in expectations.
First, I wrote down the unadjusted group means in the form of an expectation. I wrote down a conditional expectation, since we are looking at the mean of
Species is restricted to a certain category. We can explicitly show this by taking the expectation of a random variable, Sepal Length, while setting another random variable, Species, equal to only one category at a time.
More generally, we could write out the unadjusted group mean using a group indicator variable, Species, which can take on all possible values species.
So that’s our unadjusted group means. What about the adjusted group mean? We can start by writing out the linear regression model, which is the expected value of SepalLength, conditional on the random variables SepalWidth and Species.
When I used the
predict function on the fit of that linear regression model, I obtained the fitted values from that expectation, before I separated the fitted values by group to get the grouped means. We can see those fitted values as random variables themselves, and write out another conditional mean using a group indicator variable, just as we did for the unadjusted group means earlier.
My table of unadjusted and adjusted Sepal Length means thus showed me that:
Or, in more general notation:
E[E[Y|X,Z]|Z=z] = E[Y|Z=z]]
Is it true?! Spoiler alert – yes. Let’s work through the steps of the proof one by one.