Take a good look at the diagram. Did? Perfect. Now let’s understand this step by step:
- ETL — Extract, Transform, and Load
This is where most data scientists spend their time, cleaning data. In this step, we do all the things that are necessary to form a good dataset before we feed it to our machine learning algorithm — from creating data frames to doing feature engineering, all fall under ETL. If you want to start slow and want to know how to implement basic ETL, head over to this article where we perform ETL and predict who survived the Titanic.
To obtain the best results, we first have to train our model with data, so when the next time it sees something similar, it knows what it is looking at. This is the stage where your model’s training and tuning takes place.
We then put our model into operation to be able to respond to user queries after going through several processes. For example, ranking. Based on user queries we rank the results and we deliver it back to the user. Think of how Google presents to you a new set of results for each new query. Neat huh?
To avoid any misconception here, know that this is just a general idea of how an end-to-end workflow works. There is a lot of work behind the scene. Just like a movie, or a play, or opera.
The need for High-Performance Computing
You might be wondering why we need to speed this process up? Let me briefly tell you why.
What happens is, all the three stages that we discussed above have their own set of challenges when it comes to computation. With an enormous amount of data generated every day and high data processing requirements (Terabytes of Dataset), let’s just say CPUs are simply not enough.
Data scientists spend most of their time cleaning and processing (aka ETL), and no one wants that process to take half a day or an entire day. I know, I wouldn’t.
And with companies quickly moving towards acquiring and processing petabytes of data, we need speed now more than ever. Let’s understand that better with an example.
Understanding the data
First things first, you should know your data better than you know yourself. The NYC taxi dataset is popular and widely available, you can download the data set from here. It contains 9 files for 9 different months starting from January 2014, and each file is approximately 2–2.2GB large, yes each file. You probably get the idea. We will be processing all files together.
First, let’s see what’s in those files. There are 17 columns or features in the dataset. I have listed each one of them, and they are self-explanatory.
Simple, right? Good. Now what about the records, you might be wondering. We have whooping 124M rows or records in all those 9 files combined. Yes, that’s right, 124 Million. 124,649,497 to be precise. Any guess how much time it will take to complete the ETL process? Stay with me to find out.
We will be using cuDF with Dask & XGBoost to scale GPU DataFrame ETL style operations and for model training. To know more about cuDF read fellow communicator George’s post on “How to Use CuPy to Make Numpy Over 10X Faster”. For the sake of simplicity, I am going to keep the code minimal in the post but you can find the entire ipynb notebook and many other E2E example notebooks on RAPIDS’ GitHub repository here.
As we already know we need to clean the data, so we will tidy up our dataset.
We have the same column names represented differently in different CSV files. For example, one file has rate_code and another has RateCodeID, while both are universally accepted ways to represent column names, we have to go with either one. I always choose the first one as the underscore separates the words and it is easy for my eyes to read them. Forever team lazy.
Outliers are there. Always. To break something, to stop something from running efficiently and we have to handle them. For example, fare less than $0 or greater than $500, who would give $500? The same with passenger count, discard entries with < 0 and > 6 entries.
After cleaning up the dataset we have discarded almost 7 million records and now have 117 million from which we can actually gain insights.
Picking a training set
Let’s imagine you are going to make a trip to New York, don’t go just imagine, on the 25th, and want to build a model to predict what fare prices will be like in the last few days given the data of the first part of the month.
We are measuring the time to know how much time it will take for your cluster to load data from the storage bucket and ETL portion of the workflow. At this point, we have 92M data for training and the remaining 25% for testing.
The wall time represents the total time, which is 2min 3s or 123 seconds, on the CPU. Impressive, huh? But what if I tell you we can achieve even faster results?
In the same process, just after enabling 1 GPU (NVIDIA TITAN RTX 24 GB DDR6) we can finish the same job in 63 seconds. Nearly a 2x boost up. If you think that is great, wait till I show you the boost up on training time. Also, I was monitoring the CPU usage while doing so. All cores working, it was soothing to my eyes, not going to lie.