Skip to content
Search
Generic filters
Exact matches only

Aggregation, Transform, Filter — How and When to use them?

In case you’re wondering, when I say “victim”, it’s because I’m too spoiled by the capabilities of Pandas until I meet ,
who gave me some hard time understanding the mechanisms under the hood. After devoting some time digging into it, I have a much better understanding of it. Now, let’s come straight to the point.

I always believe in the power of a simple dataset for the purpose of methods illustration. Thus, a very simple dataset will be used in this article: Population and life expectancy in different countries/regions.

As you may be already familiar with aggregations with basic calculations like , etc. Aggregation, Transform, and Filter bring goupby to another level.

First, let’s take a glance at how function works. In 2011, Hadley Wickham introduced the idea “split-apply-combine” in his paper “The Split-Apply-Combine Strategy for Data Analysis”, which made function illustrative. Generally, a function consists of three steps: split, apply, and combine. The step breaks up the dataframe into subset dataframes based on the specified keys. Then, step applies functions to those subset dataframes. Last, step concatenates those results into an output array.

Bear the above process in mind, we can easily understand . Basically, apply multiple functions to a column and many columns. First, it splits the full dataframe into sub dataframes based on “Region”. Then it applies mean function to “Life Expectancy” and applies sum function to “Population”. Last, it combines the results into series that can be converted to dataframe through “reset_index”. Finally, you’ll get a different dataframe whose length is the number of unique values of the groupby keys (“Region” in our case). Below is the visual process of .