Skip to content
Search
Generic filters
Exact matches only

A Quick Little Lesson on KNN. For Beginners, by a Beginner | by Forest Franzose | Aug, 2020

For Beginners, by a Beginner

Forest Franzose

As the title says, here is a quick little lesson on how to construct a simple KNN model in SciKit-Learn. I will be using . It contains information on students’ academic performance.

Features included are things like how many times a student raises their hand, their gender, parent satisfaction, how often they were absent from class, and how often they participated in class discussion.

Each student is grouped into one of three academic classes: High (H), Medium (M), and Low (L). I used the other features in order to predict which class they fall in.

Just for reference:

  • High, 90–100
  • Medium, 70–89
  • Low, 0–69

Okay, cool! Let’s get started.

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from statsmodels.formula.api import ols

from sklearn.metrics import precision_score, recall_score,
accuracy_score, f1_score
import matplotlib.pyplot as plt
%matplotlib inline

First, you want to import all of the libraries that you’re going to need. Some people import each library at each stage of the process, but personally I like to do it all at the beginning.

Technically we won’t really be using Seaborn or MatplotLib, but I like to keep them around just in case I want to visualize something during the process.

df = pd.read_csv('xAPI-Edu-Data.csv')
df.head()
Screenshot of partial output.

Cool! The data is in good shape to begin with. There are no missing values and no outliers to speak of. However, we will have to do a small amount of preprocessing to get it ready for our model.

Preprocessing

# Dropping all unnecessary columns

df = df.drop(['NationalITy', 'PlaceofBirth', 'StageID', 'GradeID',
'SectionID', 'Topic', 'Relation',
'ParentAnsweringSurvey'],
axis = 1,
inplace = False)
df.head()

Screenshot of output.

When feeding a KNN model, you only want to include the features that you actually want to be making the decision. This may seem obvious but I figured it was worth mentioning.

# Binary encoding of categorical variables

df['gender'] = df['gender'].map({'M': 0, 'F': 1})
df['Semester'] = df['Semester'].map({'F': 0, 'S': 1})
df['ParentschoolSatisfaction'] = df['ParentschoolSatisfaction'].map({'Good': 0, 'Bad': 1})
df['StudentAbsenceDays'] = df['StudentAbsenceDays'].map({'Under-7': 0, 'Above-7': 1})

df.head()

Screenshot of output.

Something perhaps not so obvious if you have never done this, is that you have to encode your categorical variables. It makes sense if you think about it. A model can’t really interpret ‘Good’ or ‘Bad’, but it can interpret 0 and 1.

# Check for missing values

df.isna().sum()

Screenshot of output.

I know I already said that we don’t have any missing values, but I just like to be thorough.

# Create a new dataframe with our target variable, remove the target variable from the original dataframe

labels = df['Class']
df.drop('Class', axis = 1, inplace = True)

And then —

df.head()
Screenshot out output.
labels.head()
Screenshot of output.

Next, we want to separate our target feature from our predictive features. We do this in order to create a train/test split for our data. Speaking of!

X_train, X_test, y_train, y_test = train_test_split(df, labels,
test_size = .25,
random_state =
33)

*I realize the above formatting is terrible, I’m just trying to make it readable for this Medium article.

This next part brings up two important points:

  1. You need to scale the data. If you don’t, variables with larger absolute values will be given more weight in the model for no real reason. We have our features that are binary encoded (0, 1) but we also have features on how many times student raise their hands (0–80). We need to put them on the same scale so they have the same importance in the model.
  2. You have to scale the data AFTER you perform the train/test split. If you don’t, you will have leakage and you will invalidate your model. For a more thorough explanation, check out by Jason Browlee who has tons of amazing resources on machine learning.

The good news is, this is extremely easy to do.

scaler = StandardScaler()

scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.transform(X_test)

scaled_df_train = pd.DataFrame(scaled_data_train, columns =
df.columns)

scaled_df_train.head()
Screenshot of output.

Awesome. Easy peasy lemon squeezy, our data is scaled.

# Instantiate the model
clf = KNeighborsClassifier()

# Fit the model
clf.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = clf.predict(scaled_data_test)

It really truly is that simple. Now, we want to see how well our baseline model performed.

def print_metrics(labels, preds):
print("Precision Score: {}".format(precision_score(labels,
preds, average = 'weighted')))
print("Recall Score: {}".format(recall_score(labels, preds,
average = 'weighted')))
print("Accuracy Score: {}".format(accuracy_score(labels,
preds)))
print("F1 Score: {}".format(f1_score(labels, preds, average =
'weighted')))
print_metrics(y_test, test_preds)
Screenshot of output.

And there you have it, with almost no effort, we created a predictive model that is able to classify students into their academic performance class with an accuracy of 75.8%. Not bad.

We can probably improve this by at least a few points by tuning the parameters of the model, but I will leave that for another post.

Happy learning. 😁