Skip to content
Search
Generic filters
Exact matches only

A Practical Guide for Exploratory Data Analysis — Churn Dataset | by Soner Yıldırım | Sep, 2020

Let’s check how “Gender” and “Geography” are related to customer churn. One way is to use the groupby function of pandas.

df[['Geography','Gender','Exited']].groupby(['Geography','Gender']).agg(['mean','count'])

Finding: In general, females are more likely to “exit” than males. The exit (churn) rate in Germany is higher than in France and Spain.

Another common practice in the EDA process is to check the distribution of variables. Distribution plots, histograms, and boxplots give us an idea about the distribution of variables (i.e. features).

fig , axs = plt.subplots(ncols=2, figsize=(12,6))fig.suptitle("Distribution of Balance and Estimated Salary", fontsize=15)sns.distplot(df.Balance, hist=False, ax=axs[0])sns.distplot(df.EstimatedSalary, hist=False, ax=axs[1])

Most of the customers have zero balance. For the remaining customers, the “Balance” has a normal distribution. The “EstimatedSalary” seems to have a uniform distribution.

Since there are lots of customers with zero balance, We may create a new binary feature indicating whether a customer has zero balance. The where function of pandas will do the job.

df['Balance_binary'] = df['Balance'].where(df['Balance'] == 0, 1)df['Balance_binary'].value_counts()
1.0 6383
0.0 3617
Name: Balance_binary, dtype: int64

Approximately one-third of customers have zero balance. Let’s see the effect of having zero balance on churning.

df[['Balance_binary','Exited']].groupby('Balance_binary').mean()

Finding: Customers with zero balance are less likely to churn.

Another important statistic to check is the correlation among variables.

Correlation is a normalization of covariance by the standard deviation of each variable. Covariance is a quantitative measure that represents how much the variations of two variables match each other. To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value.

By checking the correlation, we are trying to find how similarly two random variables deviate from their mean.

The corr function of pandas returns a correlation matrix indicating the correlations between numerical variables. We can then plot this matrix as a heatmap.

It is better if we convert the values in the “Gender” column to numeric ones which can be done with the replace function of pandas.

df['Gender'].replace({'Male':0, 'Female':1}, inplace=True)corr = df.corr()plt.figure(figsize=(12,8))sns.heatmap(corr, cmap='Blues_r', annot=True)
The correlation matrix

Finding: The “Age”, “Balance”, and “Gender” columns are positively correlated with customer churn (“Exited”). There is a negative correlation between being an active member (“IsActiveMember”) and customer churn.

If you compare “Balance” and “Balance_binary”, you will notice a very strong positive correlation since we created one based on the other.

Since “Age” turns out to have the highest correlation values, let’s dig in a little deeper.

df[['Exited','Age']].groupby('Exited').mean()

The average age of churned customers is higher. We should also check the distribution of the “Age” column.

plt.figure(figsize=(6,6))plt.title("Boxplot of the Age Column", fontsize=15)sns.boxplot(y=df['Age'])

The dots above the upper line indicate outliers. Thus, there are many outliers on the upper side. Another way to check outliers is comparing the mean and median.

print(df['Age'].mean())
38.9218
print(df['Age'].median())
37.0

The mean is higher than the median which is compatible with the boxplot. There are many different ways to handle outliers. It can be the topic of an entire post.

Let’s do a simple one here. We will remove the data points that are in the top 5 percent.

Q1 = np.quantile(df['Age'],0.95)df = df[df['Age'] < Q1]df.shape
(9474, 14)

The first line finds the value that distinguishes the top 5 percent. In the second line, we used this value to filter the dataframe. The original dataframe has 10000 rows so we deleted 526 rows.

Please note that this is not acceptable in many cases. We cannot just get rid of rows because data is a valuable asset and the more data we have the better models we can build. We are just trying to see if outliers have an effect on the correlation between age and customer churn.

Let’s compare the new mean and median.

print(df['Age'].mean())
37.383681655055945
print(df['Age'].median())
37.0

They are pretty close. It is time to check the difference between the average age of churned customers and those who did not churn.

df[['Exited','Age']].groupby('Exited').mean()

Our finding still holds true. The average age of churned customers is higher.