An instrumental variable is a third variable, Z, used in regression analysis when you have endogenous variables — variables that are influenced by other variables in the model. In other words, you use it to account for unexpected behavior between variables. Using an instrumental variable to identify the hidden (unobserved) correlation allows you to see the true correlation between the explanatory variable and response variable, Y. — Statistics How To
Let’s break down some of this into pieces we can understand.
Part 1: The Linear Regression Equation
Let’s say you have two variables that you think are correlated, education and wages (X and Y). You would like to investigate if education leads to higher wages, i.e. X → Y. It makes sense enough. You write y = α + βx + ε, and, content with yourself, spend the rest of the night binging Game of Thrones.
Wait. Slow down. First let’s clarify some things.
- α = the “starting point”. Not all regressions are going to start at zero; for example, if X is education and Y is wages, you’re not going to start at zero education, because most people nowadays don’t drop out after the 2nd, 3rd, 4th, or even 9th grade (yay, progress!). You’re probably going to be looking at education from high school diploma onward, one year at a time, and the intercept accounts for this.
- β = the weight in which X effects Y. For example, if 1 year of education was predicted to make $100 in additional salary, then the coefficient β would be 100. β specifies just how much an additional year of education gets you.
- ε = the error term. Economists love using fancy jargon and Greek almost as much as they love snappy titles, so ε is just a fancy “e” for error. This absorbs anything that X couldn’t perfectly map to Y; very rarely are you going to get a perfectly straight curve that maps one-to-one.
Now that we’ve translated y = x to y = α + βx + ε. The problem now has to do with the theory of if it’s X truly leading to Y. Education leads to wages and that makes sense; but what if people who strive for higher education will also earn higher wages because they are a more energetic, ambitious, and driven subset of the population?
This is a big problem. Why? Because it’s not X that’s leading to Y, it’s something else that leads to Y. That “something else” is currently absorbed in the error term because we can’t measure ambition. This violates a basic assumption about linear regressions. Economists call it endogeneity.
Part 2: Picking an Instrumental Variable
We want to use y = α + βx + ε, but it has quickly become clear that x, education, and y, wages, are also being affected by z, ambition/drive/that magic quality that creates people like Michael Jordan. Since we can’t measure ambition and deliver it into a tidy CSV, what do we do?
We use something else, something measurable, that correlates with education (X) but has nothing to do with the error term (ε).
These are the requirements of an IV: 1) they can’t correlate with the error (exogeneity), and 2) they do correlate with X (education).
In this case, early smoking behavior is a great instrument. Why? Because early smoking behavior and years of education attained are correlated. Early smoking behavior and ambition, on the other hand, aren’t; lots of successful people had rough childhoods where they smoked. In fact, Dr. Matt Dickson already published a paper on this effect.
Part 3: Using an Instrumental Variable via 2SLS
Now you have the data on X (education), Y (wages), and Z (early smoking behavior). You’ve declared that Z will make a great instrumental variable to deal with the endogeneity inherit in X. One question remains: How do we include an instrumental variable in a regression equation?
We’re going to do this by creating two equations, which is called a Two Stage Least Squares (2SLS) estimate. All it accomplishes is a slight redefinition of the education variable to be a function of early smoking behavior:
- education = c + d*(early smoking behavior) + v
- wages = α + β*education + ε
In this example, c is the starting point (like α), d is the weight (like β), and v is the error (like ε). We first recalibrate the education variable to be a function of early smoking behavior, then we plug our new definition into the original equation. That’s why it’s called Two Stage Least Squares; we’re creating two equations to correctly answer an initial one.
Part 4: Analysis in R
You now have all the theory you need to understand what an instrumental variable is and how we use it IRL with 2SLS. Running the analysis in R is as easy as pie:
install.packages("AER") # library with ivreg() function
library(AER)reg_1 = ivreg(wages ~ education | early_smoking_age, data = _source)
In human words; this regression is of education on wages with early smoking age being used as an instrument. The datasource being used is called
_source. I’m naming this regression
reg_1 so I can call on it later.
ivreg() documentation is here.
Part 5: Conclusions
It took a long time for me to understand the purpose of an instrumental variable, or how to pick a great one; I even earned 2nd place at the 2019 UChicago Econometrics Games before fully understanding it. For this reason, if you don’t completely understand everything about an IV that’s O.K. and totally normal.
I hope you feel a little more equipped to work with instrumental variables now; if you have any thoughts please comment (or leave a clap) below. Thanks for reading! ❤️