Skip to content
Search
Generic filters
Exact matches only

A Visual Primer to Linear Regression

Towards Robust Estimation

…or, What’s In A Mean?

David S. Fulford
The estimator ŷ exhausts the available information of predictors x₁ and x₂

Linear regression, or least squares regression, is the simplest application of machine learning, and arguably the most important. Many people apply the method without realization. Whenever you compute an arithmetic mean, we have a special case of linear regression — that is, that the best predictor of a response variable is the bias (or mean) of the response itself!

Linear regression has been had quite a lot written about it on TowardsDataScience, so why author another article on it? My purpose is not to “show how it is done”, but to illustrate linear regression as a convenient and practical example of a more fundamental concept — estimation — and to develop an of the mechanisms for readers.

The word “linear” in “linear model” does not refer to the individual terms of the model such as whether they are squared, or have a square root, etc. It is surprising to many to find out that predictor variables can have all kinds of non-linear transformations applied to them, often have applied in order to create a valid linear model. Rather, “linear” refers to the behavior of the model as a whole: A linear model is one in which a linear combination of the predictor variables yields a prediction of a response variable. This means that

is a linear model, where could be (using some example “measurement” value from our data):

I’ve found that many people are resistant to the idea that we can manipulate our variables however we desire. But why do we accept that a linear combination of predictor variables is a valid method of making a prediction in the first place? The only rules of a least squares regression are that the residuals must have a mean of zero, be uncorrelated, and be homoscedastic (a fancy word to mean the variance of the residuals must be constant). We can do anything, and in fact must often do a lot of things, in order to satisfy these three rules!

In order to generalize, we have to introduce some linear algebra. We can alternatively write our model in vector notation as:

where the predictor variables are given as column vectors:

In this form, we can begin to form a better understanding of the image of the geometric representation at the top of this article. It’s quite obvious that we make a prediction as the sum of vectors and , but what is , the residual? It’s the sum of all information that and do not contain, and is orthogonal to both of them! And here we have our intercept, or bias term, come into play. As we scale to higher dimensions, the concept of an “intercept” doesn’t really make as much intuitive sense as we cannot visualize all variables on a single plot. Instead, consider this as a bias term we are adding in order to “put our thumb on the scale”, so to speak, to make our predictions more accurate than they would otherwise be if we just used the predictor variables. If we redraw the figure including a bias term, we get something like:

The bias term is added to set the average of the residuals to zero

And we see that the bias term, , reduces the magnitude of the residual. (Note: This is not completely accurate as the residual should be orthogonal to and , and but alas we are limited to three spatial dimensions). Of course, we are only looking at a single prediction corresponding to a single value of and , whereas an actual linear model would be making a prediction for each value of and . The bias term is calculated as the value that that sets the average of all residuals to zero.

Let’s go a bit further with our regression equations to illustrate just how we determine the coefficients. First, we can move from writing the model with vectors, to matrix notation. We first combine , and into a matrix:

And then simplify our regression equation:

And here, since we’ve discussed that the residuals are, we’ve added the residual term as we know they must also be included to obtain the actual measured value of the response variable. Remember that the residuals are orthogonal to all predictor variables the prediction of the response variable, meaning that they contain unexplained deviation that may exist in other predictor variables we do not have observations of.

We won’t go into the derivation of the solution, but if we solve for the values of that the equation, we obtain:

And if you need some help with visualizing the matrix multiplication, we have:

Now, let’s consider a special case of linear regression, in which we only have — that is to say, we only have a unit column vector. This reduces the prior matrix multiplications to:

And substituting into the regression equation, we have:

And hopefully, you recognize this! It’s saying that the best predictor of a linear model where we only have a response variable is the average of the response variable itself!

Why is this significant? Well, this suggests that the required assumptions for least squares regression apply to the calculation of averages. These are:

  • The residuals have a Gaussian distribution with mean of zeroThis is satisfied by the act of regression itself when we compute the bias term, as we learned above.
  • The residuals have constant variance across all predicted values (homoscedasticity).
  • The residuals are i.i.d. (independent and identically distributed). There can be no correlation, or autocorrelation such as one often finds in time series data.

Since we are only dealing with a single response variable, points two and three can be taken as speaking about the variable itself. If we have poor sampling such there is over-representation in a specific range of values as compared to other ranges, or if the samples are at all correlated (i.e. values are dependent upon one another), then an average is the best estimator of a variable.

If we invert back from averages to linear models, we can also say that least squares regression is the best method of regression for problems that do not satisfy our rules.

Least squares regression isn’t always going to work, but there are a lot of other machine learning techniques we can try, right? Well, before we throw neural networks at a wall and see what sticks, let’s consider what we can do about this shortcoming. I’m not a statistician, so before I criticize much of the basis of the entire field of statistics (!), let me instead quote an actual statistician that investigated this problem back in 1964:

Huber is credited as one of the creators of the field of research of , and is the source of the various quotes sprinkled throughout this article. His goal was to create alternative estimators to the least squares estimator that would be better able to handle outliers or other deviations from the rules of least squares estimation. As he points out, one of the primary motivations for choosing least squares among all other alternatives was for the convenience of an analytic solution. This was a BIG convenience for any period of history earlier than the late 20th century. Nowadays, however, numerical methods to solve regression problems are, to our human perception of time, often just as quick as analytic solutions. Better estimators exist.

Let’s reduce the idea of an estimator to “a measure of distance”. We typically think of Euclidean distance, , but this is just one of infinite ways of measuring distance. If you were in Manhattan on the city grid, this is an almost useless measurement to determine the distance between you and that corner restaurant you’re trying to reach.

What’s the shortest distance to lunch?

The green line, the Euclidean distance, the shortest, but also impossible to achieve. And it turns out that the red, blue, and yellow lines, which are colloquially named “taxicab distance”, are all identical in length. If you want to know how you should get somewhere based upon distance alone, a computer cannot calculate the shortest the route for you. However, this also means the solution for the shortest route is robust: It’s easy to find an alternative route if something goes (in the words of Huber) catastrophically wrong. In the next article, we’ll talk about these alternatives and learn about their properties, and in doing so, extend regression to handle cases of much greater complexity than least squares is capable of.

This takes us closer to our goal of accurate estimation.