Another incredibly easy and surprisingly way to easily boost a model’s accuracy is to get rid of any outliers in your data. There are several ways to go about removing outliers from your data, one great way of doing so is using the z-score. Another way to do this will involve getting rid of any values above the third quartile of your data. The reason why we do this is because these values can, of course, affect our mathematical representation of the data like the mean or standard deviation. This could almost certainly cause the model, which likely works off of values like the mean and standard deviation, to predict low or high, depending on where your stray data is.
The easiest way to remove outliers from data like this is to replace any values that are problematic with the mean. To start, we will get the third quartile of the data. Alternatively, you could get the mean of all of the data above the mean.
For this example, we will be switching over to Python. I will be using both languages in this article as they are both rather similar, and using both in conjunction will make it more accessible to developers on both sides of the spectrum. Consider the following DataFrame:
If we were to fit a model to this data, whichever column we decided to use for our feature would have some pretty radical outliers in it, which would certainly cause some problems for us in the future. In order to remove them, we will grab z_score from scipy.stats and use a conditional mask to filter the values that don’t adhere to a conditional to hold down values radically above the center of our normal distribution:
from scipy import stats
import numpy as np
z_scores = stats.zscore(df)abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 2).all(axis=1)
new_df = df[filtered_entries]
Now if we show new_df, we can see that our outliers are gone!
Another great way to easily boost your accuracy by modifying data is by removing bad features. It’s all too common — you have the correct model, but not the correct data. Although a lot of times some features might seem important, they might not necessarily be so. There are often features that might not even be improving your score at all, as a matter of a fact, they could even be making your accuracy worse. So bad features are certainly something to watch out for!
But how can you avoid them?
The best way to avoid using a bad feature in your model is to use statistical testing to measure how statistically significant that feature is to your target. Another great way to do this is to get your feature importances, which might be a bit more difficult in certain situations, but when it isn’t will certainly be a much easier way than statistical testing.