Support vector regression is a special kind of regression that gives you some sort of buffer or flexibility with the error. How does it do that ? I’m going to explain it to you in simple terms by showing 2 different graphs.
The above is an hypothetical linear regression graph. You can see that the regression line is drawn at a position with minimum sqaured errors. Errors are basically the sqaures of difference in distance between the original data point (points in black) and the regression line (predicted values).
The above is the same setting with SVR(Support Vector Regression). You can observe that there are 2 boundaries around the regression line. This is a tube with the vertical distance of epsilon above and below the regression line. In reality, it is kown as epsilon insensitive tube. The role of this tube is that it creates a buffer for the error. To be specific, all the data points within this tube are considered to have zero error from the regression line. Only the points outside of this tube are considered for calculating the errors. The error is calculated as the distance from the data point to the boundary of the tube rather than data point to the regression line (as seen in Linear Regression)
Why support vector ?
Well, all the points outside of the tube are known as slack points and they are essentially vectors in a 2-dimensional space. Imagine drawing vectors from the origin to the individual slack points, then you can see all the vectors in the graph. These vectors are supporting the structure or formation of the this tube and hence it is known as support vector regression. You can understand it from the below graph.
Let us deep dive into python and build a random forest regression model and try to predict the salary of an employee of 6.5 level(hypothetical).
Before you move forward, please download the CSV data file from my GitHub Gist.
Once you open the link, you can find "Download Zip" button on the top right corner of the window. Go ahead and download the files.
You can download 1) python file 2)data file (.csv)
Rename the folder accordingly and store it in desired location and you are all set.If you are a beginner I highly recommend you to open your python IDE and follow the steps below because here, I write detailed comments(statements after #.., these do not compile when our run the code) on the working of code. You can use the actual python as your backup file or for your future reference.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Import Data and Define the X and Y variables
dataset = pd.read_csv(‘/Users/tharunpeddisetty/Desktop/Position_Salaries.csv’) #add your file pathX = dataset.iloc[:,1:-1].values
y = dataset.iloc[:, -1].values#iloc takes the values from the specified index locations and stores them in the assigned variable as an array
Let us look at our data and understand the variables:
This data depicts the position/level of the employee and their salaries. This is the same dataset that I used in my Decision Tree Regression article.
#Feature Scaling. Required for SVR. Since there’s no concept of coefficients
#we need to reshape y because standard scaler class expects a 2D array
y=y.reshape(len(y),1)from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
# create a new sc object because the first one calcualtes the mean and standard deviation of X. We need different values of mean and standard deviation for Y
There is no concept of coefficients like linear regression in SVR, so in order to reduce the effect of high valued features we need to scale the features or in other words get all the values under one scale. We achieve this by standardizing the values. Since we have only one feature in this example, we would apply on it anyway. We do that using the StandardScaler() function from sklearn. But, for other datasets, do not forget to scale all your features and the dependent variable. Also, remember to reshape the Y (dependent variable i.e., Salary), which is purely for the sake of passing it through the standard scaler in python.
Training the SVR model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
Simple, isn’t it ? We are going to use Radial Basis Function as the kernel inside the SVR algorithm. This means that we are using a function called ‘rbf’ in order to map the data from one space to another. Explaining how this works is out of the scope of this article. But, you can always research about it online. The choice of kernel function varies with the distribution of the data. I suggest you research about them after implementing this basic program in python.
Visualizing the results of SVR Regression
X_grid = np.arange(min(sc_X.inverse_transform(X)), max(sc_X.inverse_transform(X)), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(sc_X.inverse_transform(X), sc_y.inverse_transform(y), color = 'red')
plt.plot(X_grid, sc_y.inverse_transform(regressor.predict(sc_X.transform(X_grid))), color = 'blue')
plt.title('Support Vector Regression')
You can see how this model fits to the data. Do you think it is doing a great job ? Compare the results of this with the other regressions that were implemented on the same data in my previous articles and you can see the difference or wait until the end of this article.
Predicting 6.5 level result using Decision tree Regression
#We also need to inverse transform in order to get the final result
Make sure you apply all the transformations as that of initial data so that it is easier for the model to recognize the data and produce the relevant results.
Let me summarize all the results from various regression models so that it is easier for our comparison.
Support Vector Regression: 170370.0204065
Random Forest Regression: 167000 (Output is not part of the code)
Decision Tree Regression: 150000 (Output is not part of the code)
Polynomial Linear Regression : 158862.45 (Output is not part of the code)
Linear Regression predicts: 330378.79 (Output is not part of the code)
You have the data in front of you. Now, act as a manager and take a decision by yourself. How much salary would you give an employee at 6.5 level (consider level to be the years of experience)? You see, there’s no absolute answer in data science. I can not say that SVR performed better than others, so that is the best model to predict the salaries. If you ask about what I think, I feel the prediction result of random forest regression is realistic than SVR. But again, that is my feeling. Remember that a lot of factors come into play such as position of the employee, average salary in that region for that position and the employee’s previous salary etc. So, don’t even believe me if I say random forest result is the best one. I only said that it is more realistic than others. The end decision depends on the business case of the organization and by no means there is a perfect model to predict the salary of the employee perfectly.
Congratulations! You have implemented support vector regression in the minimum lines of code. You now have a template of the code and you can implement this on other datasets and observe results. This marks the end of my articles on regression. Next stop is Classification models. Thanks for reading. Happy Machine Learning!