Step 5- We will check if there are any null values in the training data and take any action required before going anywhere near modelling.
Here we can see that there are total 200,000 records in the training data and all three features have few values missing. For example, feature 1 has 60 values (200000 –199940) missing.
Step 6- We use SimpleImputer to fill the missing values with the mean values of the other records for a feature. In the below code, we use the strategy= “mean” for the same. Scikit-learn provides different strategies viz. mean, median, most frequent and constant value to replace the missing value. I will suggest you please self explore the effect of each strategy on the training model as a learning exercise.
In the code below, we have created an instance of SimpleImputer with strategy “Mean” and then fit the training data into it to calculate the mean of each feature. Transform method is used to fill the missing values with the mean value.
imputer = SimpleImputer(strategy="mean")
TransformData = imputer.transform(Data)
Step 7- It is good practice to check the features once more after replacing the missing values to ensure we do not have any null (blank) values remaining in our training dataset.
We can see that now all the features have non-null i.e non-blank values for 200,000 records.
Step 8- Before we start training the algorithms, let us check the Variance inflation factor (VIF) among the independent variables. VIF quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity. I will encourage you all to read the Wikipedia page on Variance inflation factor to gain a good understanding of it.
vif = pd.DataFrame()
vif["features"] = X.columns
vif["vif_Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape)]
In the above code, we calculate the VIF of each independent variables and print it. In general, we should aim for the VIF of less than 10 for the independent variables. We have seen earlier in the heatmap that none of the variables is highly correlated, and the same is reflecting in the VIF index among the features.
Step 9- We will extract the target i.e. dependent variable values from the RawData dataframe and save it in a data series.
Step 10- We will be evaluating the performance of various regressors viz. HuberRegressor, LinearRegression, Ridge and others on outlier dataset. In the below code, we created instances of the various regressors.
Huber = HuberRegressor()
Linear = LinearRegression()
Step 11- We declared a list with instances of the regressions to pass it in sequence in a for a loop later.
estimators = [Linear,SGD,SVR,Huber, Ridge, Elastic,PassiveAggressiveRegressor]
Step 12- Finally, we will train the models in sequence with the training data set and print the coefficients of the features calculated by the model.
for i in estimators:
print(str(i)+" Coefficients:", np.round(i.coef_,2))
We can observe a wide range of coefficients calculated by different models based on their optimisation and regularisation factors. Feature 1 coefficient calculated coefficient varies from 29.31 to 76.88.
Due to a few outliers in the training dataset a few models, like linear and ridge regression predicted coefficients nowhere near the true coefficients. Huber regressor is quite robust to the outliers ensuring loss function is not heavily influenced by the outliers while not completely ignoring their effects like TheilSenRegressor and RANSAC Regressor. Linear SVR also more options in the selection of penalties and loss functions and performed better than other models.