Tricks I wish I knew sooner to get more value out of Pandas
As the most popular Python library for analytics, Pandas is a big project that offers various data manipulation and processing capabilities. It is probably no exaggeration to say that data scientists, myself included, use Pandas on a day-to-day basis in our work.
This blog is Part 1 of the mini-series dedicated to sharing my top 10 lesser-known yet most favorable features in Pandas. Hopefully, you can walk away with some inspirations to make your own code more robust and efficient.
The dataset for this mini-series is from the Table of food nutrients, a Wikipedia page containing 16 tabular lists for basic foods categorized by food types, and their nutrients. For this demonstration, specifically, we will work with a subset of the Dairy products table, as shown below,
1. Scraping tables from HTML with
When it comes to web scraping in Python, my go-to library used to be the
BeautifulSoup until I discovered
read_html() in Pandas. Without the hassle of parsing the HTML page, we can directly extract the data stored as HTML tables,
Noticed the arg.
match = ‘Fortified milk’ in the code? It is used to only select table(s) containing the string or regular expression specified, which, in our case, is the dairy table. The
match arg. will be extremely handy when our HTML page becomes big.
Looking at the output display, however, we realize that quite a few rows and columns are truncated!
2. Loading customized
Working with Pandas for a while, you probably know that it allows users to configure display-related options in its options system. For instance, setting
pd.options.display.max_rows = 1000 solves the display issue above, which literally made my day back when I first learned Pandas!
However, my frustrations piled up on the fact that I had to re-write the same config. options every single time I started a new IPython session. Little did I know that these options can be compiled into the IPython startup file,
Then we simply set the
PYTHONSTARTUP environment variable to point at this startup file, and all the convenient settings will be automatically loaded when Python starts. Here is the printout after the startup file being executed,
3. Switching from
As we just saw, this data contains missing values due to the sub-categories in the ‘Food’ column,
Intuitively, our next step is to remove the missing values by either eliminating the rows with NAs or populating them with the desired values. Regardless of which option we decide to go with, we need first to concatenate the ‘Food’ column as they each contain unique information. This, as we all know, can be achieved via looping through rows with the
However, a better way of looping in Pandas is
.itertuples(), which, as the name suggests, returns a tuple rather than a pd.Series as in
More often than not, I find
.itertuples() runs way faster than
.iterrows() because the former eliminates several overhead in calling
.iterrows() method, e.g., checking data types. More details can be found here if you are interested.
Now, let’s explore how to concatenate our ‘Food’ values with
Tricks in this syntax include (1) initializing two global variables to store the current ‘Food’ value and the row index, which happens to be the 1st element in the ‘row’ tuple; (2) modifying the original dairy table by assigning back the new concatenated ‘Food’ value.
Here shows the output we envisioned,
4. Imputing missing values with
As we briefly touched upon, the other way of handling missing values is to populate them with desired values, which, in our case, would be values in the next consecutive row. Now the
fillna() function comes in handy,
With just one line of code, our missing values are imputed,
Here we specify the arg.
method = ‘bfill’: backward fill that propagates the first non-null values backward. Another option is ‘ffill’: forward fill, which is particularly useful in time series data. More detail about this parameter and its multiple value options is linked here.
In fact, when I first learned the
.fillna(), I enjoyed it very much because it allows me to impute NAs without first subsetting the rows with NAs. Nevertheless, not until I discovered the arg.
method that this function quickly became one of my favorite functions in Pandas.
5. Handling booleans with
With NAs being filled in, now how about us taking a different approach to concatenate the ‘Food’ values? Step 1, let’s create some key variables indicating which rows are to be combined, i.e., two consecutive rows having the same values on variables other than ‘Food’,
Two tricks here: (1) using
shift(1) to select the previous row, which is then compared with the current row; (2) calling
cumsum() for booleans converted to an integer vector (i.e., 1=True, 0=False) to “calculate” if the two rows are different.
Since we compared all 3 columns in the dataset, there are 3 additional key variables (e.g., Measure_) created from the
cumsum(). Once we merge them back with the dairy table, we will see the following output, where all 3 key variables stay the same for the rows to be combined,
6. Introspecting the
GroupBy object with
Having our key variables available, we can continue to concatenate the ‘Food’ values using another favorable trick:
.groupby() + .apply(), and the code below returns the same output as in item №3,
Similar to the GROUPBY operation in SQL,
.groupby() in Pandas chunks our data into groups defined by the key variables. But how does it work exactly? To introspect the object
.DataFrameGroupBy, we can leverage the lesser-used
From the partial printout below, we can see that each group is a tiny dataframe pieced from the original dataset, and when we call
apply(), the same operation (e.g., join in our case) will iterate through individual group until it reaches the end,
Therefore, explicitly visualizing the
GroupBy object is extremely helpful in ensuring that we are applying the operations against the correct sub-dataset.