Generic filters
Exact matches only

# Assumptions of Common Machine Learning Models

OLS regression attempts to explain if there is a relationship between your independent variables (predictors) and your dependent variable (target).

It does this fitting a line to your data by minimizing the sum of squared residuals.

The residual is the difference between an observed value and the predicted value. Residuals are used as an indication to how well your model fits to the data.

However, to be able to trust and have confidence in the results, there are some assumptions that you must meet prior to modeling.

Satisfying all these assumptions would allow you to create the best possible estimates for your model.

There are 5 key assumptions in OLS regression model.

## Assumption 1: There is a Linear Relationship between the Independent and Dependent Variables.

This assumption caught me off guard when I first heard about it in my statistics class.

I remember feeling so tricked and deceived after I reviewed my exam results that it has etched itself into my memory.

Food for thought.

Which of these equations meet this assumption?

`Y = β₀ + β₁X₁+ β₂X₂`

`Y = β₀ + β₁X₁+ β₂X₂²`

It turns out that both are linear.

There is often a misinterpretation of what is considered a linear equation.

Linear equations = straight lines
Nonlinear equations = curved lines
This is wrong.

When statisticians say that an equation is linear, they are referring to linearity in the parameters and that the equation takes on a certain format.

This is the format:

`Y = Constant + Parameter1 * Variable1 + Parameter2 * Variable2 …`

Note:

1. There must be a constant
2. The other terms follow the pattern of “Parameter * Variable” and everything is added up together.

It does not matter if the variables are nonlinear (i.e. squared), as long as the equation follows this specified format, it is a linear equation. Any other equation that fails to follow this format is nonlinear.

This also means that some linear equation lines when fitted, are curved.

So technically… using scatter-plots alone doesn’t really tell you if the fitted curve you see is linear or not. You will probably need to look at the equation of the curve.

## Assumption 2: No Multicollinearity

Multicollinearity refers to the high correlation between your independent variables.

Multicollinearity is a problem because it creates redundant information that will cause the results of your regression model to be unreliable.

To circumvent this issue, you could deploy two techniques:

1. Run a correlation analysis across all your independent variables.
2. Remove independent variables with high Variance Inflation Factor (VIF)*. As a general rule of thumb a `VIF > 10` is a strong indication of multicollinearity.

`*VIF = 1 ÷ (1-R²)`

## Assumption 3: No Autocorrelation

Autocorrelation refers to the residuals not being independent of each other. i.e. Previous observation residuals causing a systematic increase/decrease of your current observed residuals.

As a consequence, it will cause you to underestimate your variance which will affect the results of your confidence intervals or hypothesis tests.

To check for autocorrelation, you can deploy the Durbin-Watson ‘D’ test. Any values between `1.5 < d < 2.5` satisfies this assumption.

Otherwise, to remedy for autocorrelation, you should apply the “Autocorrelation-robust Standard Errors (HAC)” formula when calculating the standard errors to correct for the autocorrelation.

Note: You might come across “HAC” as the “Newey–West estimator”.

## Assumption 4: Residuals should be Homoskedastic

Homoskedasticity is the idea that your residual plot should show an even and random pattern across all observations.

In other words, the variance of your residuals should be consistent across all observations and should not follow some form of systematic pattern.

In the image below, the first plot shows a systematic pattern in the residual plot. This is also known as Heteroskedasticity; invaliding the assumption.

The plot below it shows hows a homoskedastic residual plot should look like.

So what is the problem with heteroskedasticity anyway?

1. Your unbiased estimates will no longer be the best.
2. It affects the calculation of the standard errors which would inadvertently affect the results of any hypothesis tests.

To resolve the first problem of heteroskedasticity, a good way is to increase your sample size.

For the second problem, you should apply the “robust standard error” formula to account for effects of heteroskedasticity on your error.

Note: “Robust Standard Error” is also knows as “Heteroskedasticity-Consistent Standard Error” (HC). When programming, you might encounter it as “HC”.

## Assumption 5: All Independent Variables are Normally Distributed

This assumption is optional in terms of producing the best unbiased estimates.

However, it is needed if you want to perform hypothesis testing to produce confidence intervals or prediction intervals.

Note: You can review the difference between the two here.

There are two ways to check for normality:

1. Create histogram plots for each Independent Variable.

2. Run a Q-Q plot on the residuals. All observations should follow a straight line if the residuals are normal.

If you need to meet this assumption but your variables are not normally distributed, you could perhaps transform your variables.