Skip to content
Search
Generic filters
Exact matches only

Accelerating end-to-end Machine Learning workflows with NVIDIA RAPIDS

Take a good look at the diagram. Did? Perfect. Now let’s understand this step by step:

  1. ETL — Extract, Transform, and Load
    This is where most data scientists spend their time, cleaning data. In this step, we do all the things that are necessary to form a good dataset before we feed it to our machine learning algorithm — from creating data frames to doing feature engineering, all fall under ETL. If you want to start slow and want to know how to implement basic ETL, head over to this article where we perform ETL and predict who survived the Titanic.
  2. Training
    To obtain the best results, we first have to train our model with data, so when the next time it sees something similar, it knows what it is looking at. This is the stage where your model’s training and tuning takes place.
  3. Inference
    We then put our model into operation to be able to respond to user queries after going through several processes. For example, ranking. Based on user queries we rank the results and we deliver it back to the user. Think of how Google presents to you a new set of results for each new query. Neat huh?

To avoid any misconception here, know that this is just a general idea of how an end-to-end workflow works. There is a lot of work behind the scene. Just like a movie, or a play, or opera.

The need for High-Performance Computing

You might be wondering why we need to speed this process up? Let me briefly tell you why.

What happens is, all the three stages that we discussed above have their own set of challenges when it comes to computation. With an enormous amount of data generated every day and high data processing requirements (Terabytes of Dataset), let’s just say CPUs are simply not enough.

Data scientists spend most of their time cleaning and processing (aka ETL), and no one wants that process to take half a day or an entire day. I know, I wouldn’t.

And with companies quickly moving towards acquiring and processing petabytes of data, we need speed now more than ever. Let’s understand that better with an example.

Understanding the data

First things first, you should know your data better than you know yourself. The NYC taxi dataset is popular and widely available, you can download the data set from here. It contains 9 files for 9 different months starting from January 2014, and each file is approximately 2–2.2GB large, yes each file. You probably get the idea. We will be processing all files together.

First, let’s see what’s in those files. There are 17 columns or features in the dataset. I have listed each one of them, and they are self-explanatory.

vendor id
pickup_datetime
dropoff_datetime
passenger_count
trip_distance
pickup_latitude
pickup_longitude
rate_code
dropoff_latitude
dropoff_longitude
payment_type
fare_amount
surcharge
mta_tax
tip_ammount
tolls_ammount
total_amount

Simple, right? Good. Now what about the records, you might be wondering. We have whooping 124M rows or records in all those 9 files combined. Yes, that’s right, 124 Million. 124,649,497 to be precise. Any guess how much time it will take to complete the ETL process? Stay with me to find out.

Finding the size of data

We will be using cuDF with Dask & XGBoost to scale GPU DataFrame ETL style operations and for model training. To know more about cuDF read fellow communicator George’s post on “How to Use CuPy to Make Numpy Over 10X Faster”. For the sake of simplicity, I am going to keep the code minimal in the post but you can find the entire ipynb notebook and many other E2E example notebooks on RAPIDS’ GitHub repository here.

Data Cleanup

As we already know we need to clean the data, so we will tidy up our dataset.

We have the same column names represented differently in different CSV files. For example, one file has rate_code and another has RateCodeID, while both are universally accepted ways to represent column names, we have to go with either one. I always choose the first one as the underscore separates the words and it is easy for my eyes to read them. Forever team lazy.

Defining what type our columns should have
Output after cleaning up (map_partitions is a helper function)

Handling outliers

Outliers are there. Always. To break something, to stop something from running efficiently and we have to handle them. For example, fare less than $0 or greater than $500, who would give $500? The same with passenger count, discard entries with < 0 and > 6 entries.

Handling outliers

After cleaning up the dataset we have discarded almost 7 million records and now have 117 million from which we can actually gain insights.

Picking a training set

Let’s imagine you are going to make a trip to New York, don’t go just imagine, on the 25th, and want to build a model to predict what fare prices will be like in the last few days given the data of the first part of the month.

We are measuring the time to know how much time it will take for your cluster to load data from the storage bucket and ETL portion of the workflow. At this point, we have 92M data for training and the remaining 25% for testing.

ETL time on CPU

The wall time represents the total time, which is 2min 3s or 123 seconds, on the CPU. Impressive, huh? But what if I tell you we can achieve even faster results?

ETL time on GPU

In the same process, just after enabling 1 GPU (NVIDIA TITAN RTX 24 GB DDR6) we can finish the same job in 63 seconds. Nearly a 2x boost up. If you think that is great, wait till I show you the boost up on training time. Also, I was monitoring the CPU usage while doing so. All cores working, it was soothing to my eyes, not going to lie.