- 1 EXPLORING DATA WITH PYTHON
- 2 Making your data even more delicious with joins and Pandas
- 3 Setting the stage
- 4 Step 1: taking a quick look at the data we’d like to join
- 5 Step 2: avoid blind dates for your data
- 6 Step 3: alright, let’s give the blind date a try…
- 7 Step 4: so what’s wrong with the blind date?
- 8 Step 5: cleaning up after a blind join
- 9 Step 6: a little challenge for the data addicts out there
- 10 Wrapping it up
EXPLORING DATA WITH PYTHON
Making your data even more delicious with joins and Pandas
Between data blends, joins, and wrestling with the resulting levels of detail in Tableau, managing relationships between data can be tricky.
Stepping into your shiny new Python shoes, the world opens up a bit. Instead of trying to squeeze everything into a single data source to rule them all, you can choose your battles.
In our previous articles, we already saw:
- Pandas groupby expressions and basic visualizations
- Calculations and coloring sales by profitability
- Data exploration with advanced visuals
Setting the stage
In this article, we’ll focus on one of the most important aspects of working with data in any ecosystem: joins.
Lucky for us, the Tableau data we have been playing around with comes with batteries included! Within the context of the fictitious store we have been analyzing, we have a bit of data detailing various items that were returned.
Let’s use this data on orders and returns to get comfortable joining data in Python using the Pandas library. Along the way, we’ll see how different types of joins can help us accomplish different goals.
Step 1: taking a quick look at the data we’d like to join
Remember that orders data we’ve been working with? Let’s get a good look at it real quick.
Ah, that’s familiar. Now let’s see what’s behind door #2. But first, in the spirit of getting a feel for how Pandas works, let’s highlight how we are getting our hands on this data.
In the first article, we used Pandas read_excel() function to get our orders data from an Excel file. It turns out this file has multiple sheets:
Let’s ignore the ‘People’ sheet for now (nobody likes them anyway), and simply acknowledge the fact that when we used the read_excel() function without specifying a sheet name, the function grabbed the first sheet available in the Excel workbook.
Here’s how we can specify that we want to fetch data from the ‘Returns’ sheet:
returns_df = pd.read_excel('Global Superstore.xls', sheet_name='Returns')
And here’s what that data looks like:
Step 2: avoid blind dates for your data
In the data joining world, an arranged marriage is better than a blind date. That is, we want to be the overbearing parents who inspect the columns we are about to join our data on, looking for any inconsistencies that would make for a poor match. Blindly joining our data together just because we assume the columns match up could lead to some undesired results.
To prove that point, let’s take a stroll into a blind date and see what happens.
For starters, let’s observe which columns our two data sets have in common.
It looks to me like we have ‘Order ID’ and ‘Market’ in common. Of course, in many real world scenarios the same information might be stored under different names, but this isn’t the real world. It’s an example!
By the way, here’s a shortcut you could use to find values that exist in both column lists:
common_cols = [col for col in store_df.columns if col in returns_df.columns]
What you just saw is called a list comprehension. If you end up falling in love with Python, these will become like second nature to you. They’re very useful. In this case, our list comprehension provides us any value of list X whose values also appear in list Y.
Now that we know we have two columns in common, what are we waiting for!? Initiate the blind date!
Step 3: alright, let’s give the blind date a try…
So we have ‘Order ID’ and ‘Market’ in common between our orders and our returns. Let’s join them.
In Pandas, you can use the join() method and you can also use the merge() method. Different strokes for different folks. I prefer the merge() method in pretty much any scenario, so that’s what we’ll use here. Team merge, for life.
blind_date_df = store_df.merge(returns_df, on=['Order ID', 'Market'], how='left')
And let’s see what the results look like:
First of all, we ran .shape on the resulting dataframe to get values for the number of rows and columns (in that order). So our resulting dataframe has 51,290 rows and 25 columns, where the original orders dataframe has 51,290 rows and 24 columns.
This join has effectively sprinkled in new data for each of our rows, providing one additional column named ‘Returned’, which takes on the value of ‘Yes’ if an order was returned.
Note that in our join we specified the columns to join on as well as how to perform the join. What is this ‘left’ join? It simply means that the table that was there first (in this example that is our store_df) will remain as-is, and the new table’s data will be sprinkled onto it wherever relevant.
Let’s compare this to an inner join:
This inner join behaves differently from our left join in the sense that it only keeps the intersection between the two tables. This type of join would be useful if we only cared about analyzing orders that were returned, as it filters out any orders that were not returned.
Step 4: so what’s wrong with the blind date?
Sometimes you think you know everything, and that’s when it bites you the hardest. In this example, we think we know that we have two matching columns: ‘Order ID’ and ‘Market’. But do our two data sets agree on what a market is?
Let’s stir up some drama. Orders, how do you define your markets?
This line of code takes a look at the entire ‘Market’ column and outputs the unique values found within it.
Okay. Returns, how do you define your markets?
It looks like our orders and returns teams both need to get on the same page in terms of whether they use acronyms for markets or spell them out.
On the orders side, to avoid future issues we should probably switch ‘Canada’ to an acronym value like ‘CA’.
On the returns side, to avoid future issues we should probably switch ‘United States’ to ‘US’
But wait, that only fixes the future issues… what kinds of problems is this causing right now?
Step 5: cleaning up after a blind join
To see the mess we’re in, let’s look at how many returns we have per market (using the inner join from earlier):
Hurray, it looks like our US market is perfect and has no returns! Or wait, is it the United States market… ah, oops.
Because the data containing our orders calls the market ‘US’ and the data containing our returns calls the market ‘United States’, the join will never match the two.
Luckily, it’s really easy to rename our markets. Here’s a quick way to do it in this situation, where there’s really just one mismatch that’s causing a problem. This introduces the concept of a lambda function, which you can simply ignore for now if it makes no sense to you.
returns_df['Market'] = returns_df['Market'].apply(lambda market: market.replace('United States', 'US'))
Basically what this does is it creates a function on the fly that we use quickly to perform a useful action in a single line of code. The result of running the line of code above is that any occurrence of the ‘United States’ market has been renamed to ‘US’.
Now, if we run that inner join between store_df and returns_df, the results will look a bit different:
And if we check how many returns there are per market, we get this:
Step 6: a little challenge for the data addicts out there
Now that we know how to join our orders and our returns, can you figure out how to stitch together a table like the one shown below?
Looks like the ‘Tables’ sub-category is crying out for attention again! We can’t escape it.
Applying what we’ve learned so far in this series, can you recreate this table on your own? If you’re a real over-achiever, go ahead and build it in Tableau as well and compare the process. How do you handle the market mismatch in Tableau vs Python? There are multiple ways to crack the case — go try it out!
Wrapping it up
Joining data is an absolutely crucial skill if you’re working with data at scale. Understanding what you’re joining is as important as knowing the technical details of how to execute the joins, keep that in mind!
If you send your data on a blind date to be joined with another table, be aware of the risks. Scrub your data sets clean before sending them on dates with other data. Dirty data tends to leave a mess, and you’ll be the one troubleshooting it.
Hope to see you next time as we dive into crafting reusable code using functions!