Skip to content
Search
Generic filters
Exact matches only

Bank Institution Term Deposit Predictive Model | by Glory Odeyemi | Sep, 2020

Correlation shows the relationship between variables in the dataset.

Seaborn boxplot is one of the ways of checking a dataset for outliers.

# Using boxplot to identify outliers
for col in num_data:
ax = sns.boxplot(num_data[col])
save(f"{col}")
plt.show()

The code above visualizes the numerical columns in the dataset and outliers detected were treated using the Interquartile Range (IQR) method. The code can be found in this GitHub repository.

In the course of the EDA, I found out that our target variable ‘y’ — has the client subscribed to a term deposit? (binary: ‘yes’,’no’), is highly imbalanced and that can affect our prediction model. This will be taken care of shortly and this article gives justice to some techniques of dealing with class imbalance.

Data Preprocessing

When building a machine learning model, it is important to preprocess the data to have an efficient model.

# create list containing categorical columns
cat_cols = ['job', 'marital', 'education', 'default', 'housing',
'loan', 'contact', 'month', 'day_of_week', 'poutcome']
# create list containing numerical columns
num_cols = ['duration', 'campaign', 'emp.var.rate',"pdays","age", 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'previous']

The following preprocessing was done in this stage:

  • Encoding Categorical columns

Machine learning algorithms only read numerical values, which is why we need to change our categorical values to numerical values. I made use of pandas get_dummies method and type-casting to one-hot encode the columns.

# function to encode categorical columns
def encode(data):
cat_var_enc = pd.get_dummies(data[cat_cols], drop_first=False)
return cat_var_enc
# defining output variable for classification
dataset_new['subscribed'] = (dataset_new.y == 'yes').astype('int')
Image by author
  • Rescaling Numerical columns

Another data preprocessing method is to rescale our numerical columns; this helps to normalize our data within a particular range. Sklearn preprocessing StandardScaler() was used here.

# import library for rescaling
from sklearn.preprocessing import StandardScaler
# function to rescale numerical columns
def rescale(data):
# creating an instance of the scaler object
scaler = StandardScaler()
data[num_cols] = scaler.fit_transform(data[num_cols])
return data
Image by author
  • Specifying Dependent and Independent Variables

To proceed in building our prediction model, we have to specify our dependent and independent variables.

Independent variables — are the input for a process that is being analyzed.

Dependent variable — Dependent variable is the output of the process.

X = data.drop(columns=[ "subscribed", 'duration'])
y = data["subscribed"]

The column ‘duration’ was dropped because it highly affects the output target (e.g., if duration=0 then y=’no’).

It is reasonable to always split the dataset into train and test set when building a machine learning model because it helps us to evaluate the performance of the model.

# import library for splitting dataset
from sklearn.model_selection import train_test_split
# split the data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.1,random_state=1)

In a case whereby we have a large number of variables, it is advisable to consider reducing these variables by keeping the most important ones, and there are various techniques for doing this, such as; PCA, TSNE, autoencoders, etc. For this project, we will be considering PCA.

# import PCA
from sklearn.decomposition import PCA
# create an instance of pca
pca = PCA(n_components=20)
# fit pca to our data
pca.fit(X_train)
pca_train = pca.transform(X_train)
X_train_reduced = pd.DataFrame(pca_train)
Image by author

As earlier stated, we have a highly imbalanced class, and this can affect our prediction if not treated.

Image by author

In this project, I made use of SMOTE (Synthetic Minority Oversampling Technique) for dealing with class imbalance.

# importing the necessary function 
from imblearn.over_sampling import SMOTE
# creating an instance
sm = SMOTE(random_state=27)
# applying it to the training set
X_train_smote, y_train_smote = sm.fit_sample(X_train_reduced, y_train)

Note: It is advisable to use SMOTE on the training data.

Machine Learning Model

Whew!, we finally made it to building the model; data preprocessing can be such a handful when trying to build a machine learning model. Let’s not waste any time and dive right in.

The machine learning algorithm that was considered in this project includes;

  • Logistic Regression
  • XGBoost
  • Multi Layer Perceptron

and the cross validation (this is essential especially in our case where we have an imbalanced class) method used includes;

  • K-Fold: K-Fold splits a given data set into a K number of sections/folds where each fold is used as a testing set at some point.
  • Stratified K-Fold: This is a variation of K-Fold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
# import machine learning model libraries
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
# import libraries for cross validation
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
metrics = ['accuracy', 'roc_auc', f1', 'precision', 'recall']# function to build machine learning models
def model(model, cv_method, metrics, X_train, X_test, y_train):
if (model == 'LR'):
# creating an instance of the regression
model_inst = LogisticRegression()
print('Logistic Regressionn----------------------')
elif (model == 'XGB'):
# creating an instance of the classifier
model_inst = XGBClassifier()
print('XGBoostn----------------------')
elif (model == 'MLP'):
# creating an instance of the classifier
model_inst = MLPClassifier()
print('Multi Layer Perceptronn----------------------')

# cross validation
if (cv_method == 'KFold'):
print('Cross validation: KFoldn--------------------------')
cv = KFold(n_splits=10, random_state=100)
elif (cv_method == 'StratifiedKFold'):
print('Cross validation: StratifiedKFoldn-----------------')
cv = StratifiedKFold(n_splits=10, random_state=100)
else:
print('Cross validation method not found!')
try:
cv_scores = cross_validate(model_inst, X_train, y_train,
cv=cv, scoring=metrics)
# displaying evaluation metric scores
cv_metric = cv_scores.keys()
for metric in cv_metric:
mean_score = cv_scores[metric].mean()*100
print(metric+':', '%.2f%%' % mean_score)
print('')

except:
metrics = ['accuracy', 'f1', 'precision', 'recall']
cv_scores = cross_validate(model_inst, X_train, y_train,
cv=cv, scoring=metrics)
# displaying evaluation metric scores
cv_metric = cv_scores.keys()
for metric in cv_metric:
mean_score = cv_scores[metric].mean()*100
print(metric+':', '%.2f%%' % mean_score)
print('')

return model_inst

Evaluation Metrics

  • Accuracy: The number of correctly predicted data points. This can be a misleading metric for an imbalanced dataset. Therefore, it is advisable to consider other evaluation metrics.
  • AUC (Area under the ROC Curve): It provides an aggregate measure of performance across all possible classification thresholds.
  • Precision: It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.
  • Recall: It refers to the percentage of total relevant results correctly classified by your algorithm.
  • F1 score: This is the weighted average of Precision and Recall.
K-Fold Cross Validation Evaluation Metrics
Stratified K-Fold Evaluation Metrics

Comparing Results

  • K-Fold vs Stratified K-Fold

As can be seen from the table above, Stratified K-Fold presented a much better result compared to the K-Fold cross validation. The K-Fold cross validation failed to provide the AUC score for the Logistic Regression and XGBoost model. Therefore, for further comparison, Stratified K-Fold results would be used.

From the result gotten, XGBoost proves to be a better prediction model than Logistic Regression and MLP because it has the highest percentage values in 4/5 of the evaluation metrics.

Prediction

XGboost, being the best performing model, is used for prediction.

# fitting the model to the train data
model_xgb = xgb.fit(X_train_smote, y_train_smote)
# make predictions
y_pred = xgb.predict(X_test_pca)

Conclusion

The main objective of this project is to build a model that predicts customers that would subscribe to a bank term deposit, and we were able to achieve that by considering three different models and using the best one for the prediction. We also went through rigorous steps of preparing our data for the model and choosing various evaluation metrics to measure the performance of our models.

In the result gotten, we observe that XGBoost was the best model with high percentage values in 4/5 of the evaluation metrics.

Further Study

In this project, I used only three machine learning algorithms. However, algorithms such as; SVM, Random Forest, Decision Trees, etc. can be explored.

A detailed code for this project can be found in this GitHub repository.

I know this was a very long ride, but thank you for sticking with me to the end. I also appreciate 10 Academy once again, and my fellow learners for the wonderful opportunity to partake in this project.

Reference