Skip to content
Search
Generic filters
Exact matches only

6 Lesser-Known Yet Awesome Tricks in Pandas

Tricks I wish I knew sooner to get more value out of Pandas

Yi Li

As the most popular Python library for analytics, Pandas is a big project that offers various data manipulation and processing capabilities. It is probably no exaggeration to say that data scientists, myself included, use Pandas on a day-to-day basis in our work.

This blog is Part 1 of the mini-series dedicated to sharing my top 10 lesser-known yet most favorable features in Pandas. Hopefully, you can walk away with some inspirations to make your own code more robust and efficient.

The dataset for this mini-series is from the Table of food nutrients, a Wikipedia page containing 16 tabular lists for basic foods categorized by food types, and their nutrients. For this demonstration, specifically, we will work with a subset of the Dairy products table, as shown below,

Source: Wikipedia

1. Scraping tables from HTML with read_html(match)

When it comes to web scraping in Python, my go-to library used to be the BeautifulSoup until I discovered read_html() in Pandas. Without the hassle of parsing the HTML page, we can directly extract the data stored as HTML tables,

Noticed the arg. match = ‘Fortified milk’ in the code? It is used to only select table(s) containing the string or regular expression specified, which, in our case, is the dairy table. The match arg. will be extremely handy when our HTML page becomes big.

Looking at the output display, however, we realize that quite a few rows and columns are truncated!

2. Loading customized options automatically

Working with Pandas for a while, you probably know that it allows users to configure display-related options in its options system. For instance, setting pd.options.display.max_rows = 1000 solves the display issue above, which literally made my day back when I first learned Pandas!

However, my frustrations piled up on the fact that I had to re-write the same config. options every single time I started a new IPython session. Little did I know that these options can be compiled into the IPython startup file,

Then we simply set the PYTHONSTARTUP environment variable to point at this startup file, and all the convenient settings will be automatically loaded when Python starts. Here is the printout after the startup file being executed,

3. Switching from .iterrows() to .itertuples()

As we just saw, this data contains missing values due to the sub-categories in the ‘Food’ column,

Source: Wikipedia

Intuitively, our next step is to remove the missing values by either eliminating the rows with NAs or populating them with the desired values. Regardless of which option we decide to go with, we need first to concatenate the ‘Food’ column as they each contain unique information. This, as we all know, can be achieved via looping through rows with the .iterrows() method.