Skip to content
Search
Generic filters
Exact matches only

A Practical Tutorial on Data Manipulation with Pandas

Pandas is a powerful Python library for handling and analyzing data. It is capable of working with large datasets (unlike Excel) and tackling complex data manipulation tasks. In the rest of this article, you will be introduced with the fundamental features of Pandas and how they can be applied in the Kickstarter Projects dataset.

There are plenty of features of Pandas that not covered in this article. I would suggest referring to the Pandas Documentation whenever you work with the library.

Importing Library

Pandas does not come natively with Python so we will first need to import the library into our notebook. Luckily, Colab has Pandas pre-installed in its environment so we will not have to worry about the installation step.

pd is a common alias for Pandas.

Note that pd is an alias (i.e. a nickname) that we set for Pandas. From now on, whenever we want to use the library we can simply call pd .

Reading Data

Pandas supports a variety of input data sources like csv files, Excel spreadsheets, SQL queries, etc. In this example, we will be importing the Kickstarter Projects dataset from a csv file.

We will first need to give Colab notebook access to our Google Drive, this is where we store the csv file.

Running the code above will generate a link which you may click on. Sign in with your Google Account, then copy the authorization code to the input box you’d see in the Colab notebook after running the code.

To read the data from a csv file, we use the Pandas method pamdas.read_csv().

The method .read_csv() takes the path to the file as an argument. You might need to change it according to where you place your csv file. The data will be read and saved to a DataFrame (I am going to saved it to a variable called data). Dataframes, along with Series, are the two primary data structures used in Pandas.

  • DataFrames are two-dimensional arrays. These are like your Excel tables with multiple columns and rows and hence, two-dimensional.
  • Series are the “columns” in a DataFrame. These objects are one-dimensional only and each can contain only one data type (e.g. strings, floats, datetime, etc.)

Viewing Data

Output of data.info()

The method .info() gives us a summary report of the data, including the number of row, amount of non-null values, the data type of each Series and memory usage of the DataFrame. This is usually the first thing we do after importing data.

To display the data, we can simply write data (or however you named the variable) in the code cell and execute/run the code cell. You can do this with the shortcut shift-enter.

Alternatively, we can show just some random samples with the method .sample(n) . n is the number of samples we want to display. For example, the code below prints out 10 random rows in the Kickstarter Projects dataset.

Likewise, you can use .head(n) and .tail(n) to show n first and last rows, respectively.

To select just one or some Series (columns), we can put the Series names in square bracket [] after the DataFrame data.

Cleaning Data

The data we work with is almost never fully prepped for direct analysis. Sometimes, there might be missing or inaccurate values that mess up our calculations. In other instances, we will need to perform further processing and engineering to make more meaningful analysis.

Handling Missing Data

Pandas has several built-in functions to handle missing data. For example, the method .isnull() filters out the missing values. Combined with the aggregation method .sum() , this will yield the total number of missing values in each Series.

Output of data.isnull().sum()

In our example, we have 4 missing values in the Series name . We can further investigate what those values are by applying a filter.

Output of data[data[‘name’].isnull()]

data['name'].isnull() filters out the missing values in the name Series. It is used as an indexer when put inside the square bracket [] following the DataFrame data . NaN stands for Not a Number, it is a special numerical data type that represents unidentified or missing values.

Now that we have identified the missing values, the next step is to decide how to handle them. The easiest way is to remove the rows containing missing data. However, this approach should generally be avoided as it reduces the total amount of data we have and potentially leads to bias in our analysis. Such bias can be very significant if the missing data accounts for a large proportion in the dataset. Check out the article below to understand more methods of handling missing data.

Regarding our case, name is not a very helpful Series for our analysis, so we can safely replace the missing values in this Series with some arbitrary value or simply drop the Series from the DataFrame. Let’s start with the first method.

data['name'] selects the Series name . The method .fillna() replaces the missing data in the selected Series with a specified value. The argument inplace = True tells Pandas to make direct changes to the original DataFrame data. Leaving the argumentinplace = False per default will instead return a new Series with the missing values filled — instead of applying the changes to the actual DataFramedata .

Dropping Series

Sometimes we might want to exclude some unnecessary Series. One reason for this is that it may improve performance, especially when we work with large datasets. However, careful considerations should be taken when deciding which Series to drop, as you might potentially erase useful information.

We also include argument inplace = True to directly modify data .

Handling Errors

Having errors in our data isn’t rare and it might occur for a number of reasons, from human mistakes to system failure. Regardless, we will need to appropriately handle them so that our analysis will be more accurate. It is however often quite challenging to identify and fix those errors since they require not only technical skills but also a deep understanding of the dataset.

Numerical Data: Errors in numerical data require various statistical techniques that are probably too overwhelming for this introductory article.

Categorical Data: The quickest way to detect errors in categorical data is to print out all unique values.

Output of data[‘country’].unique()

A slightly better way is to use pandas’ method .value_counts() , which prints out the unique values AND their frequency.

Output of data[‘country’].value_counts()

This Series tells where a project on Kickstarter was from. There is a strange value N,0", which seems to be an error. Let’s see what they are.

Output of data[data[‘country’] == ‘N,0″’].sample(10)

These values are errors during data collection (undefined state and 0 backer) and since they account for only ~1% of the whole dataset, we can just drop them.

Working with Datetime Data

Datetime data often require different types of treatment. For example, we might want to calculate the duration between two dates or extract a time part (month, year, hour, minute, etc.). Pandas makes these tasks simple with built-in datetime functions.

First, we will need to convert the datetime Series to correct their correct data types.

Ouput of data.info() after converting deadline and launched Series to datetime

To extract any datetime component, we use the sub-module dt .

The code above creates a new Series in our DataFrame about the year when the projects were launched. With that, we can easily identify an error in this Series.

This code returns 1970, which doesn’t make any sense as Kickstarter was founded in 2008! This is obviously an error so we can exclude any rows with launched year less than 2008 from our DataFrame.

To calculate the duration of the projects, we can just take the difference between deadline and launched Series.

This performs element-wise subtraction between the two datetime Series and returns a timedelta Series, which we can convert to time component like month or day.

We will need the help of numpy here. Numpy is another very powerful Python library that adds support to arrays and matrices. Pandas is built on top of Numpy (Remember a DataFrame is a two-dimensional array or matrix?) so the two libraries work well together. To convert timedelta to month, change 'D' to 'M' , or 'h' if you want to convert to hour instead. Read more HERE.