Once you understand your data, a majority of your time spent as a data scientist is on this step, data preprocessing. This is when you spend your time manipulating the data so that it can be modeled properly. Like I said before, there is no universal way to go about this. HOWEVER, there are a number of essential things that you should consider which we’ll go through below.
Feature Imputation is the process of filling missing values. This is important because most machine learning models don’t work when there are missing data in the dataset.
One of the main reasons that I wanted to write this guide is specifically for this step. Many articles say that you should default to filling missing values with the mean or simply remove the row, and this is not necessarily true.
Ideally, you want to choose the method that makes the most sense — for example, if you were modeling people’s age and income, it wouldn’t make sense for a 14-year-old to be making a national average salary.
All things considered, there are a number of ways you can deal with missing values:
- Single value imputation: replacing missing values with the mean, median, or mode of a column
- Multiple value imputation: modeling features that have missing data and imputing missing data with what your model finds.
- K-Nearest neighbor: filling data with a value from another similar sample
- Deleting the row: this isn’t an imputation technique, but it tends to be okay when there’s an extremely large sample size where you can afford to.
- Others include: random imputation, moving window, most frequent, etc…
Feature encoding is the process of turning values (i.e. strings) into numbers. This is because a machine learning model requires all values to be numbers.
There are a few ways that you can go about this:
- Label Encoding: Label encoding simply converts a feature’s non-numerical values into numerical values, whether the feature is ordinal or not. For example, if a feature called car_colour had distinct values of red, green, and blue, then label encoding would convert these values to 1, 2, and 3 respectively. Be wary when using this method because while some ML models will be able to make sense of the encoding, others won’t.
- One Hot Encoding (aka. get_dummies): One hot encoding works by creating a binary feature (1, 0) for each non-numerical value of a given feature. Reusing the example above, if we had a feature called car_colour, then one hot encoding would create three features called car_colour_red, car_colour_green, car_colour_blue, and would have a 1 or 0 indicating whether it is or isn’t.
When numerical values are on different scales, eg. height in centimeters and weight in lbs, most machine learning algorithms don’t perform well. The k-nearest neighbors algorithm is a prime example where features with different scales do not work well. Thus normalizing or standardizing the data can help with this problem.
- Feature normalization rescales the values so that they’re within a range of [0,1]/
- Feature standardization rescales the data to have a mean of 0 and a standard deviation of one.
Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step but here are some things that you can consider:
- Converting a DateTime variable to extract just the day of the week, the month of the year, etc…
- Creating bins or buckets for a variable. (eg. for a height variable, can have 100–149cm, 150–199cm, 200–249cm, etc.)
- Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called “Is_women_or_child” which was True if the person was a woman or a child and false otherwise.
Next is feature selection, which is choosing the most relevant/valuable features of your dataset. There are a few methods that I like to use that you can leverage to help you with selecting your features:
- Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.
- Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.
Dealing with Data Imbalances
One other thing that you’ll want to consider is data imbalances. For example, if there are 5,000 examples of one class (eg. not fraudulent) but only 50 examples for another class (eg. fraudulent), then you’ll want to consider one of a few things:
- Collecting more data — this always works in your favor but is usually not possible or too expensive.
- You can over or undersample the data using the scikit-learn-contrib Python package.