Skip to content
Search
Generic filters
Exact matches only

5 Steps to Create a Basic Machine Learning Model using Python

In this article, we will explore Udemy Class data from Kaggle.com and try and predict which classes are successful using Pandas, Matplotlib, Seaborn, and Scikit-learn.

Andrew Hong

The data set can be found here, and the code is found at this GitHub repo.

To start, we will import the following packages and read in the data set:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

We’ll run df.info() to peek at our data set before we start our analysis. We can see we have 3678 courses, and luckily have no missing data in the columns!

RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
# Column Non-Null Count Dtype
- – – -- – – ---------- – – - –
0 course_id 3678 non-null int64
1 course_title 3678 non-null object
2 url 3678 non-null object
3 is_paid 3678 non-null bool
4 price 3678 non-null int64
5 num_subscribers 3678 non-null int64
6 num_reviews 3678 non-null int64
7 num_lectures 3678 non-null int64
8 level 3678 non-null object
9 content_duration 3678 non-null float64
10 published_timestamp 3678 non-null object
11 subject 3678 non-null object
dtypes: bool(1), float64(1), int64(5), object(5)
memory usage: 319.8+ KB

We can see ‘published_timestamp’ is stored as an object, so let’s convert that to a date time variable before moving on.

df["published_timestamp"] = pd.to_datetime(df["published_timestamp"])

Here are the steps we will take together:

  1. Distribution Analysis: What are the common subjects, duration, and price points of courses?
  2. Combining Features: Can we combine any columns for more interesting features?
  3. Comparing Features with Target Variable: Can we get any initial insights on what leads to a successful course?
  4. Building a Model: Can we predict which courses are the most successful?
  5. Improving and Deploying the Model: What are the next steps we can take to build a better model and make it easy for others to access?

Distribution Analysis:
One common way to start is using df.hist() on numerical data, and then plotting bar charts after applying value_counts() on categorical data. We can separate numerical and categorical data using df.select_dtypes([‘type’]).

num_cols_df = df.select_dtypes(['int64','float64','datetime64[ns, UTC]'])
cat_cols_df = df.select_dtypes(['object'])

Most of these features are exponentially distributed, except for price.

For categorical variables we plot the relative frequency using the following format:

(cat_cols_df[column_name].value_counts()/cat_cols_df.shape[0]).plot(kind="bar")

Replacing ‘column_name’ with ‘level’ gives us the following chart:

Level Relative Frequency

and again with ‘subject’:

Subject Relative Frequency

We may also want to look at sub-distributions, such as levels within a subject. To do this, we select just the subject we want by adding [cat_cols_df[‘subject’]==’Business Finance’] before [‘subject’] and run pretty much the same line of code:

(cat_cols_df[cat_cols_df['subject']=='Business Finance']['level'].value_counts()/cat_cols_df.shape[0]).plot(kind="bar")
‘Levels’ relative frequency within ‘Business Finance’

Combining Features:

I usually begin by looking for any relative values I can create. In this case, we can divide ‘content_duration’ by ‘num_lectures’ to figure out average length of lecture in hours.

num_cols_df["average_lecture_length"] = num_cols_df["content_duration"]/num_cols_df["num_lectures"]

Plotting a histogram tells us most lectures are about 0.1 of an hour, or 6 minutes long.

Average_Lecture_Length (Hours)

While we might immediately say number of subscribers makes a successful class, this doesn’t take into account that some classes are free or very cheap. For a more accurate gauge of success we should look at revenue, or ‘price’*’num_subscribers’

num_cols_df["revenue"] = num_cols_df["price"]*num_cols_df["num_subscribers"]

Plotting a histogram shows a strongly exponential distribution, and taking num_cols_df[“revenue”].mean() gives us $250,000 dollars in revenue.

Revenue ($ Hundred Thousands)

Comparing Features with Target Variable:

To compare against our target variable ‘revenue’, we can use the Seaborn library. This can help us figure out what features to include in the model (or remove later).

temp_df = pd.concat([num_cols_df,cat_cols_df], axis = 1)
sns.pairplot(temp_df, x_vars = ['content_duration','num_lectures','num_reviews','average_lecture_length'],y_vars = ['num_subscribers','price','revenue'], hue = 'subject')

The ‘hue’ argument in the pairplot() allows us to overlay categorical data on the chart. ‘num_reviews’ looks like it has a strong correlation with the y_vars, and it seems like ‘Business Finance’ subjects have a larger spread in ‘average_lecture_length’ but not much variation in any y_vars. ‘Web Development’ is much tighter in ‘average_lecture_length’, but has wide variation in y_vars. It’s possible that our combined feature ‘average_lecture_length’ won’t be a good predictor in the model.

‘Revenue’ does seem to show cleaner correlations than ‘price’ or ‘num_subscribers’, so we will stick with that as our target variable. Also from a business perspective, revenue should be more important.

Below is the same pairplot but with ‘level’ as ‘hue’ instead. ‘All Levels’ has a much larger spread across all plots than the others, this may be interesting to look into later.

Building a Model:

We start our data pipeline by preparing numerical and categorical columns from earlier separately. If we had any null values, we could drop or impute missing values in the data by applying one of the following lambda functions:

fill_mean = lambda col: col.fillna(col.mean())
fill_mode = lambda col: col.fillna(col.mode()[0])
fill_median = lambda col: col.fillna(col.median())

For ‘cat_cols_df’ we create dummy variable columns using pd.get_dummies(). We’ll remove ‘course_title’ and ‘course_url’ as those are unlikely to be useful at the moment. The following function creates dummy columns for each col specific in cat_cols’.

cat_cols_df = cat_cols_df.iloc[:,2:]

As a note, most pipelines use Scikit-learn’s OneHotEncoder as that method can treat unknown data types as well as sparse matrices. Say your training data had a column “colors” with only ‘red’ and ‘green’ as values, but the new data has an extra value of ‘blue’. ‘get_dummies()’ will create a new column for blue even though it’s not in the trained model, which will lead to errors.

After creating dummy variables for our categorical columns, we split the data into training and test sets randomly using Scikit-learn’s ‘train_test_split()’ method (‘random_state’ saves the split so results can be replicated by anyone).

For X, we create ‘X_num_cols_df’ and remove ‘course_id’ as it is likely just noise. ‘Revenue’ and ‘num_subscribers’ are also removed as those are what we are trying to predict.

from sklearn.model_selection import train_test_split

We create a model that we want to fit the data to using Scikit-learn, in this case, we’re using the Linear Regression model.

from sklearn.linear_model import LinearRegression

After fitting the model, we predict success based on the X_test data and look for its mean squared error (MSE), a commonly used scoring metric.

from sklearn.metrics import mean_squared_error

This outputs

The MSE for your model was 312248327184.6464 on 1104 values.

Improving and Deploying the Model:
Now our model clearly is terrible, so let’s try and understand why. More often than not, your first model will either be really far off — that’s okay, a lot of the real work comes from choosing subsets of features, cross validating/ensemble methods, or using different models!

However before that, it’s very important to learn about the domain of your data the best that you can — there are likely events and trends not captured in the data that may be ruining your model. Udemy often does cuts the price of all their courses to $10, regardless of initial price. It could be worth removing ‘prices’ unless we are able to find average sale price of a class. I would also recommend digging more into the third step Comparing Features with Target Variable, as understanding patterns better will help drive insights and do more than just tweaking the model itself.

Another way to improve the model could be to apply natural language processing (NLP) on the ‘course_title’ column to see if the presence of certain keywords has an effect.

The easiest method to share this model would be to put all your code in a Jupyter notebook and Binder it. If you desire a cleaner UI you can create a Plotly Dash app and deploy it on Heroku.

Best of luck and happy coding!