Explore how to analyze data and build informative graphs in a productive way using Python and Dremio
Data analysis and data visualization are essential components of data science. Actually, before the machine learning era, all data science was about the interpretation and visualization of data with different tools and making conclusions about the nature of data. Nowadays, these tasks are still present. They just became one of many miscellaneous data science jobs. Very often, the so-called EDA (exploratory data analysis) is a required part of the machine learning pipeline. It allows a better understanding of data, its distribution, purity, features, etc. Also, visualization is recommended to present the results of the machine learning work to different stakeholders. They may not be familiar with sophisticated data science principles, but it is convenient for them to look at graphs and charts. Besides, data analysis and visualization can be done as standalone tasks if there is no need to dig deeper into the data. In any case, a good data scientist should know how to analyze and visualize data.
In this tutorial, we will show how to analyze data and how to build nice and informative graphs. We will use popular Python libraries for the visualization, namely matplotlib and seaborn. Also, we will use Pandas as a tool for manipulating dataframes. The dataset we will work with is the Student Performance Data Set. We will demonstrate how to load data into AWS S3 and how to direct it then into Python through Dremio. Dremio is also the perfect tool for data curation and preprocessing. That’s why we will do some things with data immediately in Dremio, before putting it into Python’s hands.
This article assumes that you have access to Dremio and also have an AWS account. We will use Python 3.6 and Pandas, Seaborn, and Matplotlib packages. To connect Dremio to Python, you also need Dremio’s ODBC driver. All Python code is written in Jupyter Notebook environment.
There are two ways of loading data into AWS S3, via the AWS web console or programmatically. In this tutorial, we will show how to send data to S3 directly from the Python code. To be able to manage S3 from Python, we need to create a user on whose behalf you will make actions from the code. To do this, select from list of services in the AWS console, click and then press the button:
Give a name to the new user (in our case, we have chosen test_user) and enable programmatic access for this user:
On the next step, you have to set permissions. In most cases, this is an important stage, and you can tweak permissions for different users. You can even create your own access policy here. But for simplicity in this tutorial, just give the user the full access to the AWS S3:
After the user is created, you should copy the needed credentials (access key ID and secret access key). You will use them in the code later to make requests to AWS S3.
In the config file, set the region for which you want to create buckets, etc. in S3:
Now everything is ready for coding! Let’s do something simple first. For example, show the existing buckets in S3:
In the code above, we import the library boto3, and then create the client object. As a parameter, we specify “ s3” to show that we want to work with this AWS service. After that, we use the list_buckets() method of the created object to check the available buckets. Here is what we got in the response variable (an empty list with buckets):
Let’s now create a bucket. To do this, use the create_bucket() method of the client object:
Here is the output of the list_buckets() method after the creation of the bucket:
You can also see the created bucket in AWS web console:
We have two files that we need to load into Amazon S3, student-por.csv and student-mat.csv. To load these files, we use the upload_file() method of the client object:
In the end, you should be able to see those files in the AWS web console (in the bucket created earlier):
To connect Dremio and AWS S3, first go to the section in the services list, select Delete your root access keys tab, and then press the Manage Security Credentials button. Then select the Access keys tab and then click on the Create New Access Key button. Copy AWS Access Key and *AWS Access Secret *after pressing Show Access Key toggler:
In Dremio GUI, click on the button to add a new source. Then choose Amazon S3. The following window should appear:
In the window above, you should specify the name of the source ( student_performance) and the credentials that you had generated in the previous step.
In this part of the tutorial, we will show how to deal with the dataframe about students’ performance in their Portuguese classes. However, the same actions are needed to curate other dataframe (about performance in Mathematics classes).
First, open the student-por.csv file in the student_performance source. Be sure to change the type of field delimiter (“;”), line delimiter (“n”), and check the Extract Field Names checkbox, as specified on the image below:
We don’t need G1 and G2 columns, let’s drop them. Click on the arrow near the name of each column to evoke the context menu. Then select the option from the menu:
Through the same drop-down menu, we can rename the G3 column to final_target column:
Next, we have noticed that all our numeric values are of the string data type. We want to convert them to integers. To do this, click on the little Abc button near the name of the column, then select the needed datatype:
The following window will appear in the result:
In this window, we need to specify the name of the new column (the column with new data type), and also set some other parameters. When ready, press the button. We should do type conversion for all numeric columns which are strings: age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout, Dalc, Walc, health, absences.
After performing all the above operations with the data, we save the dataframe in the student_performance_space with the name port1.
In Dremio, everything that you did finds its reflection in SQL code. For example, all our actions described above generated the following SQL code (you can check it by clicking on the SQL Editor button):
Moreover, you can write your own SQL queries. Let’s say we want to create new column famsize_bin_int. This column should be binary. It should contain “1” when the value in the given row from column famsize is equal to “GT3” and “0” when the corresponding value in famsize column equals “LE3”. Here is the SQL code for implementing this idea:
On the following image, you can see that the column famsize_int_bin appears in the dataframe after clicking on the button:
Finally, we want to sort the values in the dataframe based on the final_target column. We want to see students with the lowest grades at the top of the table, so we choose Sort Ascending option from the drop-down menu:
In the end, we save the curated dataframe under the port_final name in the student_performance_space. The same is true for the mathematics dataset (we saved it as mat_final table).
To connect Dremio and Python script, we need to use PyODBC package. The code below is used to import the port_final and mat_final tables into Python as pandas dataframes.
As you can see, we need to specify host, port, dremio credentials, and the path to Dremio ODBC driver. Then we use PyODBC object’s method connect() to establish a connection. Pandas has read_sql() method to fetch data from remote sources. When creating SQL queries, we used the full paths to tables (name_of_the_space.name_of_the_dataframe).
The main goal of exploratory data analysis is to understand the data. It can be required as a standalone task, as well as the preparatory step during the machine learning process. EDA helps to figure out which features your data has, what is the distribution, is there a need for data cleaning and preprocessing, etc.
Probably every EDA starts from exploring the shape of the dataset and from taking a glance at the data. To show the first 5 records in the dataframe, you can call the head() method on Pandas dataframe. You can also specify the number of rows as a parameter of this method. In other words, five is the default number of rows displayed by this method, but you can change this to 10, for example.
To check the shape of the data, use the shape attribute of the dataframe:
You can see that there are far more rows in the Portuguese dataframe than in the Mathematics one. Further in this tutorial, we will work only with Portuguese dataframe, in order not to overload the text. But these dataframes are absolutely identical, and if you want, you can do the same operations with the Mathematics dataframe and compare the results.
Now we want to look only at the students who are from an urban district. To do this, we extract only those rows which contain value “U” in the address column:
From the output above, we can say that there are more students from urban areas than from rural areas. If we continue to work on the machine learning model further, we may find this information useful for some feature engineering, for example.
Besides head() function, there are two other Pandas methods that allow looking at the subsample of the dataframe. The tail() method returns rows from the end of the table. The sample() method returns random N rows from the dataframe. It can be helpful if you want to look not only at the beginning or end of the table but also to display different rows from different parts of the dataframe:
To inspect what columns your dataframe has, you may use columns attribute:
If you need to write code for doing something with a column name, you can do this easily using Python’s native lists.
An important step in any EDA is to check whether the dataframe contains null values. With Pandas, this can be done without any sophisticated code. Just call isnull() method on the dataframe and then aggregate values using sum() method:
As we can see, our dataframe is pretty preprocessed, and it contains no missing values.
Similarly, you may want to look at the data types of different columns. Types of data are accessible via the dtypes attribute of the dataframe:
All columns in our dataset are either numerical (integers) or categorical (object).
It is often useful to know basic statistics about the dataset. In Pandas, you can do this by calling describe() method:
This method returns statistics (count, mean, standard deviation, min, max, etc.) about each numerical column of the dataframe. To see some information about categorical features, you should specify the include parameter of the describe() method and set it to [‘O’] (see the image below). But for categorical columns, the method returns only count, the number of unique values, the most frequent value and its frequency. Most of our categorical columns are binary:
Now we are going to build visualizations with Matplotlib and Seaborn. But first, we need to import these packages:
import matplotlib.pyplot as plt import seaborn as sns
Let’s see the ratio between males and females in our dataset. To do this, we select the column sex, then use value_counts() method with normalize parameter equals True. Then we call the plot() method. This will use Matplotlib to build a graph. We can see that there are more girls (roughly 60%) in the dataset than boys (roughly 40%). The corresponding code and visualization you can find below.
We have seen the distribution of sex feature in our dataset. But often, the most interesting column is the target column. In our case, this column is called final_target (it represents the final grade of a student). This time we will use Seaborn to make a graph. Seaborn package has the distplot() method for this purpose. The parameters which we have specified are color (green) and the number of bins (10). The code and image are below:
From the histogram above, we can say that the most frequent grade is around 10–12, but there is a tail from the left side (near zero). Maybe in the future, before building a model, it is worth to transform the distribution of the target variable to make it closer to the normal distribution.
The Seaborn package has many convenient functions for comparing graphs. One of these functions is the pairplot(). You can select which columns you want to analyze and Seaborn will build a distribution of these columns at the diagonal and the scatter plots on all other places. Each scatter plot shows the interrelation between two of the specified columns. In our case, this visualization may not be as useful as it could be. It works better for continuous features, not integers. However, you can understand the gist of this type of visualization:
Let’s look at distributions of all numeric columns in our dataset using Matplotlib. First, we create a dataframe with only numeric columns ( df_num). To do this, we use select_dtypes() Pandas method. We specify that we want to take only float64 and int64 data types, but for this dataset it is enough to take only integer columns (there are no float values). Also, we drop famsize_bin_int column since it was not numeric originally.
Now, we use the hist() method on the df_num dataframe to build a graph:
In the parameters of the hist() method, we have specified the size of the plot, the size of labels, and the number of bins.
Probably, it is interesting to analyze the range of values for different columns and in certain conditions. Here is how this works. We want to see how the range of final_target column varies depending on the job of mother and father of students. We use Seaborn’s function boxplot() for this. Before this, we tune the size of the plot using Matplotlib. The graph for fathers’ jobs is shown below:
And here is the graph for mothers’ jobs:
The boxplot allows seeing the average value and low and high quartiles of data. The whiskers show the rest of the distribution. Points out of whiskers represent outliers.
Among interesting insights you can derive from the graphs above is the fact that if the father or mother of the student is a teacher, it is more probable that the student will get a high final grade.
In the same way, we can see that girls are more successful in their studies than boys:
One of the most interesting things about EDA is the exploration of the correlation between variables. We can analyze the correlation and then visualize it using Seaborn. Here we will look only at numeric columns. When doing real preparation for machine learning model training, a scientist should encode categorical variables and work with them as with numeric columns. But this is out of the topic of our tutorial.
The first row of the code below uses method the corr() to calculate correlations between different columns and the final_target feature. We drop the last record because it is the final_target (we are not interested in the fact that the final_target has the perfect correlation with itself).
The second row of the code filters out all weak correlations. In our case, we want to look only at the correlations, which are greater than 0.12 (in absolute values). We also want to sort the list in descending order.
The third row simply prints out the results.
We can see that there are 8 features that strongly correlate with the target variable. Some of them have a positive correlation, while others have negative. For example, the strongest negative correlation is with failures feature. It is reasonable that if the student has bad marks in the past, he/she may continue to study poorly in the future as well.
Also, the more alcohol student drinks on the weekend or workdays, the lower the final grade he/she has. There is also a negative correlation between freetime and traveltime variables. The more free time the student has, the lower the performance he/she demonstrates. Students who travel more also get lower grades.
At the same time, we have 3 positively correlated with the target variables: studytime, Medu, Fedu. It is obvious that the more time you spent on the studies, the better the study performance you have. The interesting fact is that parents’ education also strongly correlates with the performance of their children.
On the heatmap, you can see correlation not only with the target variable, but also the variables between each other. For example, there is a strong correlation between father’s and mother’s education, the amount of time the student goes out and the alcohol consumption, number of failures and age of the student, etc.
The exploration of correlations is one of the most important steps in EDA. It allows understanding which features may be useful, which are redundant, and which new features can be created artificially.
In this article, we walked through the steps of how to load data into AWS S3 programmatically, how to prepare data stored in AWS S3 using Dremio, and how to analyze and visualize that data in Python. We have also shown how to connect to your data lake using Dremio, as well as Dremio and Python code. During the work, we used Matplotlib and Seaborn packages.