Correlation shows the relationship between variables in the dataset.

Seaborn boxplot is one of the ways of checking a dataset for outliers.

`# Using boxplot to identify outliers`

for col in num_data:

ax = sns.boxplot(num_data[col])

save(f"{col}")

plt.show()

The code above visualizes the numerical columns in the dataset and outliers detected were treated using the Interquartile Range (IQR) method. The code can be found in this GitHub repository.

In the course of the EDA, I found out that our target variable ‘y’ — has the client subscribed to a term deposit? (binary: ‘yes’,’no’), is highly imbalanced and that can affect our prediction model. This will be taken care of shortly and this article gives justice to some techniques of dealing with class imbalance.

Contents

## Data Preprocessing

When building a machine learning model, it is important to preprocess the data to have an efficient model.

# create list containing categorical columns

cat_cols = ['job', 'marital', 'education', 'default', 'housing',

'loan', 'contact', 'month', 'day_of_week', 'poutcome']# create list containing numerical columns

num_cols = ['duration', 'campaign', 'emp.var.rate',"pdays","age", 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'previous']

The following preprocessing was done in this stage:

**Encoding Categorical columns**

Machine learning algorithms only read numerical values, which is why we need to change our categorical values to numerical values. I made use of pandas get_dummies method and type-casting to one-hot encode the columns.

# function to encode categorical columns

def encode(data):

cat_var_enc = pd.get_dummies(data[cat_cols], drop_first=False)

return cat_var_enc# defining output variable for classification

dataset_new['subscribed'] = (dataset_new.y == 'yes').astype('int')

**Rescaling Numerical columns**

Another data preprocessing method is to rescale our numerical columns; this helps to normalize our data within a particular range. Sklearn preprocessing StandardScaler() was used here.

# import library for rescaling

from sklearn.preprocessing import StandardScaler# function to rescale numerical columns

def rescale(data):

# creating an instance of the scaler object

scaler = StandardScaler()

data[num_cols] = scaler.fit_transform(data[num_cols])

return data

**Specifying Dependent and Independent Variables**

To proceed in building our prediction model, we have to specify our dependent and independent variables.

Independent variables — are the input for a process that is being analyzed.

Dependent variable — Dependent variable is the output of the process.

`X = data.drop(columns=[ "subscribed", 'duration'])`

y = data["subscribed"]

The column ‘duration’ was dropped because it highly affects the output target (e.g., if duration=0 then y=’no’).

It is reasonable to always split the dataset into train and test set when building a machine learning model because it helps us to evaluate the performance of the model.

# import library for splitting dataset

from sklearn.model_selection import train_test_split# split the data

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.1,random_state=1)

In a case whereby we have a large number of variables, it is advisable to consider reducing these variables by keeping the most important ones, and there are various techniques for doing this, such as; PCA, TSNE, autoencoders, etc. For this project, we will be considering PCA.

# import PCA

from sklearn.decomposition import PCA# create an instance of pca

pca = PCA(n_components=20) # fit pca to our data

pca.fit(X_train)

pca_train = pca.transform(X_train)

X_train_reduced = pd.DataFrame(pca_train)

As earlier stated, we have a highly imbalanced class, and this can affect our prediction if not treated.

In this project, I made use of SMOTE (Synthetic Minority Oversampling Technique) for dealing with class imbalance.

# importing the necessary function

from imblearn.over_sampling import SMOTE# creating an instance

sm = SMOTE(random_state=27)# applying it to the training set

X_train_smote, y_train_smote = sm.fit_sample(X_train_reduced, y_train)

**Note:** It is advisable to use SMOTE on the training data.

## Machine Learning Model

Whew!, we finally made it to building the model; data preprocessing can be such a handful when trying to build a machine learning model. Let’s not waste any time and dive right in.

The machine learning algorithm that was considered in this project includes;

- Logistic Regression
- XGBoost
- Multi Layer Perceptron

and the cross validation (this is essential especially in our case where we have an imbalanced class) method used includes;

**K-Fold:**K-Fold splits a given data set into a K number of sections/folds where each fold is used as a testing set at some point.**Stratified K-Fold:**This is a variation of K-Fold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

# import machine learning model libraries

from sklearn.linear_model import LogisticRegression

from xgboost import XGBClassifier

from sklearn.neural_network import MLPClassifier# import libraries for cross validation

from sklearn.model_selection import KFold

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import cross_validatemetrics = ['accuracy', 'roc_auc', f1', 'precision', 'recall']# function to build machine learning models

def model(model, cv_method, metrics, X_train, X_test, y_train):

if (model == 'LR'):

# creating an instance of the regression

model_inst = LogisticRegression()

print('Logistic Regressionn----------------------')

elif (model == 'XGB'):

# creating an instance of the classifier

model_inst = XGBClassifier()

print('XGBoostn----------------------')

elif (model == 'MLP'):

# creating an instance of the classifier

model_inst = MLPClassifier()

print('Multi Layer Perceptronn----------------------')# cross validation

if (cv_method == 'KFold'):

print('Cross validation: KFoldn--------------------------')

cv = KFold(n_splits=10, random_state=100)

elif (cv_method == 'StratifiedKFold'):

print('Cross validation: StratifiedKFoldn-----------------')

cv = StratifiedKFold(n_splits=10, random_state=100)

else:

print('Cross validation method not found!')

try:

cv_scores = cross_validate(model_inst, X_train, y_train,

cv=cv, scoring=metrics)

# displaying evaluation metric scores

cv_metric = cv_scores.keys()

for metric in cv_metric:

mean_score = cv_scores[metric].mean()*100

print(metric+':', '%.2f%%' % mean_score)

print('')except:

return model_inst

metrics = ['accuracy', 'f1', 'precision', 'recall']

cv_scores = cross_validate(model_inst, X_train, y_train,

cv=cv, scoring=metrics)

# displaying evaluation metric scores

cv_metric = cv_scores.keys()

for metric in cv_metric:

mean_score = cv_scores[metric].mean()*100

print(metric+':', '%.2f%%' % mean_score)

print('')

**Evaluation Metrics**

**Accuracy:**The number of correctly predicted data points. This can be a misleading metric for an imbalanced dataset. Therefore, it is advisable to consider other evaluation metrics.**AUC**(Area under the ROC Curve): It provides an aggregate measure of performance across all possible classification thresholds.**Precision:**It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.**Recall:**It refers to the percentage of total relevant results correctly classified by your algorithm.**F1 score:**This is the weighted average of Precision and Recall.

**Comparing Results**

**K-Fold vs Stratified K-Fold**

As can be seen from the table above, Stratified K-Fold presented a much better result compared to the K-Fold cross validation. The K-Fold cross validation failed to provide the AUC score for the Logistic Regression and XGBoost model. Therefore, for further comparison, Stratified K-Fold results would be used.

From the result gotten, XGBoost proves to be a better prediction model than Logistic Regression and MLP because it has the highest percentage values in 4/5 of the evaluation metrics.

## Prediction

XGboost, being the best performing model, is used for prediction.

# fitting the model to the train data

model_xgb = xgb.fit(X_train_smote, y_train_smote)# make predictions

y_pred = xgb.predict(X_test_pca)

## Conclusion

The main objective of this project is to build a model that predicts customers that would subscribe to a bank term deposit, and we were able to achieve that by considering three different models and using the best one for the prediction. We also went through rigorous steps of preparing our data for the model and choosing various evaluation metrics to measure the performance of our models.

In the result gotten, we observe that XGBoost was the best model with high percentage values in 4/5 of the evaluation metrics.

**Further Study**

In this project, I used only three machine learning algorithms. However, algorithms such as; SVM, Random Forest, Decision Trees, etc. can be explored.

A detailed code for this project can be found in this GitHub repository.

I know this was a very long ride, but thank you for sticking with me to the end. I also appreciate 10 Academy once again, and my fellow learners for the wonderful opportunity to partake in this project.