In case you’re wondering, when I say “victim”, it’s because I’m too spoiled by the capabilities of Pandas until I meet
Aggregation, Transform, Filter,
who gave me some hard time understanding the mechanisms under the hood. After devoting some time digging into it, I have a much better understanding of it. Now, let’s come straight to the point.
I always believe in the power of a simple dataset for the purpose of methods illustration. Thus, a very simple dataset will be used in this article: Population and life expectancy in different countries/regions.
As you may be already familiar with
groupby aggregations with basic calculations like
mean(), sum(), median(), etc. Aggregation, Transform, and Filter bring goupby to another level.
First, let’s take a glance at how
groupby function works. In 2011, Hadley Wickham introduced the idea “split-apply-combine” in his paper “The Split-Apply-Combine Strategy for Data Analysis”, which made
groupby function illustrative. Generally, a
groupby function consists of three steps: split, apply, and combine. The
split step breaks up the dataframe into subset dataframes based on the specified keys. Then,
apply step applies functions to those subset dataframes. Last,
combine step concatenates those results into an output array.
Bear the above process in mind, we can easily understand
aggregation apply multiple functions to a column and many columns. First, it splits the full dataframe into sub dataframes based on “Region”. Then it applies mean function to “Life Expectancy” and applies sum function to “Population”. Last, it combines the results into series that can be converted to dataframe through “reset_index”. Finally, you’ll get a different dataframe whose length is the number of unique values of the groupby keys (“Region” in our case). Below is the visual process of