Skip to content
Search
Generic filters
Exact matches only

Automating Random Forests. A tutorial to set up your own automated… | by Shaan Shah | Aug, 2020

A tutorial to set up your own automated machine learning system.

Shaan Shah

I recently completed developing a website which does end to end machine learning (as a GUI) i.e. it does the following steps automatically:

  • Takes the training data and the test data from the user using a form.
  • Cleans up the data and makes it usable for a machine learning model (Like filling up the missing values , dealing with categorical variables etc.).
  • Trains a Random Forest on the data and tunes it’s hyperparameters.
  • Performs feature engineering on the data.
  • Makes a feature importance plot for the final model.
  • Generates predictions on the test data and sends out the results and the feature importance plot to the email address provided by the user.

So in this blog, I will basically take you through the code to setup such a system. You can find the GitHub repo containing the core code for the machine learning part of the website here.

We will be using parts of the old fastai library (v0.7) as an initial base to get started (The code till line 500 contains snippets from the library and other necessary imports which will be used). You can either copy the snippets or install the library. We will begin coding from here!

I will be explaining the code piece by piece. So let’s get started!

The above code defines some functions which will be used repeatedly as we move forward. Two functions stand out the most :

  • “print_score”: It will be used to evaluate our model’s performance on the training and validation datasets.
  • “auto_train”: It will be used to train a random forest using given hyperparameters.

The next function “data_trainer” is a bit long so I will break it into two parts to explain it to you. It will be used to perform the following tasks:

  • Clean up the data, deal with the categorical variables, extract information from dates (if there are any) and fill up the missing values.
  • Train a random forest and tune its hyperparameters.
  • Feature Engineering.
  • Remove redundant variables.
  • Plot the feature importance graph.

So let’s dive into it!
I have put up comments to demarcate the sections corresponding to each process.

The above code ( the first part of “data_trainer”)does the following tasks :

  1. It extracts data from the date column ( if there is one ) like year, day, month, quarter end or not, year end or not and much more.
  2. It converts the categorical variables into a format such that they can be used by the machine learning model. It also fills the missing values.
  3. It splits the data into a training and validation datasets.
  4. If the dataset is very large it will use “rf_sampling” ( a method in fastai to speed things up).
  5. It will then start tuning the hyperparameters of the random forest namely the “min_samples_leaf” and “max_features” using loops to get the best results.

The above code ( the second part of “data_trainer”) will perform feature engineering and remove the variables that don’t affect our target variable or are redundant.
After that the function will return optimum hyperparameters and a list of features to be used for training the final model.

The above code defines a function to predict on the test dataset and also to generate and save the feature importance plot. It performs all of the following :

  • Extract information from date column (if any) as we did for the training dataset.
  • Train a Random Forest (using the hyperparameters and feature engineering information obtained in the previous function) on the whole dataset (training + validation) and apply it on the test dataset to generate predictions.
  • Generate feature importance plot and save it.

Now all that is left to do is tie up all the previously defined functions together by calling them in an appropriate sequence with appropriate input which is what is done below.

To summarize :
We first defined a function to clean up the training data and find optimum hyperparameters and to perform feature engineering ( called the “data_trainer”). We then defined a function to use the information obtained from the above function to train a model and use it to generate a feature importance plot and predictions for the test dataset ( called the “auto_applyer”). We then tied everything up using a function called “auto_predictor”. Also if you want to set up an emailing system to send an email containing the results you can find the code here.
Voila! We have set up our own automated machine learning system.

You can check out the website (the end to end GUI) here and it’s GitHub repo here.

Thanks a lot for reading this blog!

P.S.
Please feel free to connect with me on LinkedIn for any questions or suggestions.