Towards Robust Estimation
…or, What’s In A Mean?
Linear regression, or least squares regression, is the simplest application of machine learning, and arguably the most important. Many people apply the method every day without realization. Whenever you compute an arithmetic mean, we have a special case of linear regression — that is, that the best predictor of a response variable is the bias (or mean) of the response itself!
At the core of the method of least squares lies the idea to minimize the sum of the squared “errors,” that is, to adjust the unknown parameters such that the sum of the squares of the differences between observed and computed values is minimized.
Linear regression has been had quite a lot written about it on TowardsDataScience, so why author another article on it? My purpose is not to “show how it is done”, but to illustrate linear regression as a convenient and practical example of a more fundamental concept — estimation — and to develop an intuition of the mechanisms for readers.
The word “linear” in “linear model” does not refer to the individual terms of the model such as whether they are squared, or have a square root, etc. It is surprising to many to find out that predictor variables can have all kinds of non-linear transformations applied to them, and do often have applied in order to create a valid linear model. Rather, “linear” refers to the behavior of the model as a whole: A linear model is one in which a linear combination of the predictor variables yields a prediction of a response variable. This means that
is a linear model, where x₁ could be (using some example “measurement” value from our data):
I’ve found that many people are resistant to the idea that we can manipulate our variables however we desire. But why do we accept that a linear combination of predictor variables is a valid method of making a prediction in the first place? The only rules of a least squares regression are that the residuals must have a mean of zero, be uncorrelated, and be homoscedastic (a fancy word to mean the variance of the residuals must be constant). We can do anything, and in fact must often do a lot of things, in order to satisfy these three rules!
In order to generalize, we have to introduce some linear algebra. We can alternatively write our model in vector notation as:
where the predictor variables are given as column vectors:
In this form, we can begin to form a better understanding of the image of the geometric representation at the top of this article. It’s quite obvious that we make a prediction as the sum of vectors X₁ and X₂, but what is ê, the residual? It’s the sum of all information that X₁ and X₂ do not contain, and is orthogonal to both of them! And here we have our intercept, or bias term, come into play. As we scale to higher dimensions, the concept of an “intercept” doesn’t really make as much intuitive sense as we cannot visualize all variables on a single plot. Instead, consider this as a bias term we are adding in order to “put our thumb on the scale”, so to speak, to make our predictions more accurate than they would otherwise be if we just used the predictor variables. If we redraw the figure including a bias term, we get something like:
And we see that the bias term, b, reduces the magnitude of the residual. (Note: This is not completely accurate as the residual should be orthogonal to X₁ and X₂, and b, but alas we are limited to three spatial dimensions). Of course, we are only looking at a single prediction corresponding to a single value of X₁ and X₂, whereas an actual linear model would be making a prediction for each value of X₁ and X₂. The bias term is calculated as the value that that sets the average of all residuals to zero.
Let’s go a bit further with our regression equations to illustrate just how we determine the coefficients. First, we can move from writing the model with vectors, to matrix notation. We first combine X₀, X₁ and X₂ into a matrix:
And then simplify our regression equation:
And here, since we’ve discussed that the residuals are, we’ve added the residual term as we know they must also be included to obtain the actual measured value of the response variable. Remember that the residuals are orthogonal to all predictor variables and the prediction of the response variable, meaning that they contain unexplained deviation that may exist in other predictor variables we do not have observations of.
We won’t go into the derivation of the solution, but if we solve for the values of β that the equation, we obtain:
And if you need some help with visualizing the matrix multiplication, we have:
In the simplest case, … namely the estimation of a location parameter, … this is of course achieved by the sample mean.
Now, let’s consider a special case of linear regression, in which we only have X₀ — that is to say, we only have a unit column vector. This reduces the prior matrix multiplications to:
And substituting into the regression equation, we have:
And hopefully, you recognize this! It’s saying that the best predictor of a linear model where we only have a response variable is the average of the response variable itself!
Why is this significant? Well, this suggests that the required assumptions for least squares regression also apply to the calculation of averages. These are:
- The residuals have a Gaussian distribution with mean of zero. This is satisfied by the act of regression itself when we compute the bias term, as we learned above.
- The residuals have constant variance across all predicted values (homoscedasticity).
- The residuals are i.i.d. (independent and identically distributed). There can be no correlation, or autocorrelation such as one often finds in time series data.
Since we are only dealing with a single response variable, points two and three can be taken as speaking about the variable itself. If we have poor sampling such there is over-representation in a specific range of values as compared to other ranges, or if the samples are at all correlated (i.e. values are dependent upon one another), then an average is not the best estimator of a variable.
As is now well known, if the true distribution deviates slightly from the assumed normal distribution, the sample mean may have catastrophically bad performance”.
If we invert back from averages to linear models, we can also say that least squares regression is not the best method of regression for problems that do not satisfy our rules.
Least squares regression isn’t always going to work, but there are a lot of other machine learning techniques we can try, right? Well, before we throw neural networks at a wall and see what sticks, let’s consider what we can do about this shortcoming. I’m not a statistician, so before I criticize much of the basis of the entire field of statistics (!), let me instead quote an actual statistician that investigated this problem back in 1964:
It is interesting to look back at the very origin of the theory of estimation, namely to Gauss and his theory of least squares. Gauss was fully aware that his main reason for assuming an underlying normal distribution and a quadratic function was mathematical, i.e., computational, convenience. In later times, this was often forgotten, partly because of the central limit theorem. However, if one wants to be honest, the central limit theorem can at most explain why many distributions occurring in practice are approximately normal. The stress is on the word “approximately.”
Huber, P. J. 1964. Robust Estimation of a Location Parameter. The Annals of
Mathematical Statistics 35 (1): 73–101. https://doi.org/10.1214/aoms/1177703732.
Huber is credited as one of the creators of the field of research of robust estimation, and is the source of the various quotes sprinkled throughout this article. His goal was to create alternative estimators to the least squares estimator that would be better able to handle outliers or other deviations from the rules of least squares estimation. As he points out, one of the primary motivations for choosing least squares among all other alternatives was for the convenience of an analytic solution. This was a BIG convenience for any period of history earlier than the late 20th century. Nowadays, however, numerical methods to solve regression problems are, to our human perception of time, often just as quick as analytic solutions. Better estimators exist.
It is quite natural to ask whether one can obtain more robustness by minimization another function of the errors than the sum of their squares.
Let’s reduce the idea of an estimator to “a measure of distance”. We typically think of Euclidean distance, a² + b² + …, but this is just one of infinite ways of measuring distance. If you were in Manhattan on the city grid, this is an almost useless measurement to determine the distance between you and that corner restaurant you’re trying to reach.
The green line, the Euclidean distance, the shortest, but also impossible to achieve. And it turns out that the red, blue, and yellow lines, which are colloquially named “taxicab distance”, are all identical in length. If you want to know how you should get somewhere based upon distance alone, a computer cannot calculate the shortest the route for you. However, this also means the solution for the shortest route is robust: It’s easy to find an alternative route if something goes (in the words of Huber) catastrophically wrong. In the next article, we’ll talk about these alternatives and learn about their properties, and in doing so, extend regression to handle cases of much greater complexity than least squares is capable of.
This takes us closer to our goal of accurate estimation.