Skip to content
Search
Generic filters
Exact matches only

A Simple Way to Analyze Student Performance Data with Python

Photo by Stephen Dawson on Unsplash

Explore how to analyze data and build informative graphs in a productive way using Python and Dremio

Lucio Daza

Data analysis and data visualization are essential components of data science. Actually, before the machine learning era, all data science was about the interpretation and visualization of data with different tools and making conclusions about the nature of data. Nowadays, these tasks are still present. They just became one of many miscellaneous data science jobs. Very often, the so-called EDA (exploratory data analysis) is a required part of the machine learning pipeline. It allows a better understanding of data, its distribution, purity, features, etc. Also, visualization is recommended to present the results of the machine learning work to different stakeholders. They may not be familiar with sophisticated data science principles, but it is convenient for them to look at graphs and charts. Besides, data analysis and visualization can be done as standalone tasks if there is no need to dig deeper into the data. In any case, a good data scientist should know how to analyze and visualize data.

In this tutorial, we will show how to analyze data and how to build nice and informative graphs. We will use popular Python libraries for the visualization, namely matplotlib and seaborn. Also, we will use Pandas as a tool for manipulating dataframes. The dataset we will work with is the Student Performance Data Set. We will demonstrate how to load data into AWS S3 and how to direct it then into Python through Dremio. Dremio is also the perfect tool for data curation and preprocessing. That’s why we will do some things with data immediately in Dremio, before putting it into Python’s hands.

This article assumes that you have access to Dremio and also have an AWS account. We will use Python 3.6 and Pandas, Seaborn, and Matplotlib packages. To connect Dremio to Python, you also need Dremio’s ODBC driver. All Python code is written in Jupyter Notebook environment.

There are two ways of loading data into AWS S3, via the AWS web console or programmatically. In this tutorial, we will show how to send data to S3 directly from the Python code. To be able to manage S3 from Python, we need to create a user on whose behalf you will make actions from the code. To do this, select from list of services in the AWS console, click and then press the button: