Skip to content
Generic filters
Exact matches only

A Peek Into Missing Data with Pandas | by Kurtis Pykes | Jul, 2020

Examples with Pandas

Note: For this task we will be using a dataset from Kaggle, specifically, the Categorical Feature Encoding Challenge. The code can be found in the link below.

We will start by reading in the data using pd.read_csvDocumentation.

import pandas as pd # load and peek at data
df = pd.read_csv("../data/categorical_feature_engineering_raw/train.csv")
Figure 1: df.head(); The table is to large to fit into an image (and to large to display columns hence the ellipses).

In Pandas, a Python framework for data manipulation, missing values are represented as Nan or None, and there are multiple ways of checking whether we have any present in our data:

  • pd.isnull()
  • pd.notnull()
  • pd.isna()
  • pd.notna()
  • df.isna()
  • df.notna()
  • df.isnull()
  • df.notnull()

Yes, I know what you’re thinking, Which one should I use? Well, the biggest difference you’ll find between them is that 4 are top level functions and the other 4 are methods of pandas dataframe class (pd.DataFrame.isna()). Inherently, at the dataframe level df.isna() and df.isnull() are the same likewise with df.notna() and df.notnull() (they literally have the same documentation)— this phenomena holds true for the top level functions also.

Why have multiple methods with different names do the same thing? Pandas is built on top of NumPy which uses np.isnan() to detect missing values, ergo it has neither Na or Null values. However, Pandas uses DataFrames that are based on R DataFrames and in R Na and Null are different things — Read more about R Na andNull here and see the source of my answer here.

So I guess in that sense we have the option to make Pandas our own, a common theme in the Pandas framework.

I typically use df.isna() and the inverse df.notna(), mainly because it has less characters to type than df.isnull() and df.notnull(). Additionally, I prefer to have access straight to the DataFrame with dot notation, which rules out the top level functions for me (they require you to pass the DataFrame, or Series object as a parameter).

# see what .isna() returns
Figure 2: df.isna() returns a mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

as a sanity check, I’ll do the same with pd.isna()

Figure 3: pd.isna(df) returns a Boolean indicating whether each corresponding element is missing.

Okay, It’s the same (did you notice you had to type a few extra letters for the top level function route — Yes, I am lazy). Now we can check if df.notna() is truly the inverse of df.isna()

# see what .notna() returns
Figure 4: df.notna()

Even though the Documentation for df.notna() states that what is returned by this method is exactly the same thing as df.isna() (I personally find it confusing), df.notna() is indeed the inverse.

Beyond that point, let’s dig further into our data to check for missing values.

id 0
bin_0 0
bin_1 0
bin_2 0
bin_3 0
bin_4 0
nom_0 0
nom_1 0
nom_2 0
nom_3 0
nom_4 0
nom_5 0
nom_6 0
nom_7 0
nom_8 0
nom_9 0
ord_0 0
ord_1 0
ord_2 0
ord_3 0
ord_4 0
ord_5 0
day 0
month 0
target 0
dtype: int64

Oooops, looks like we have a Dataset that has no missing values — this is very unlikely to be the case in real-world scenarios, so we will leverage some skills that we learnt in PyTrix Series episode 4 to add some missing values — link below.

# inputting NaN's randomly into data
df.loc[2:1000:100, ["bin_0"]] = None
df.loc[:300:20, ["ord_4", "ord_5"]] = None
df.loc[500::50, ["nom_4", "bin_4"]] = None
id 0
bin_0 10
bin_1 0
bin_2 0
bin_3 0
bin_4 5990
nom_0 0
nom_1 0
nom_2 0
nom_3 0
nom_4 5990
nom_5 0
nom_6 0
nom_7 0
nom_8 0
nom_9 0
ord_0 0
ord_1 0
ord_2 0
ord_3 0
ord_4 16
ord_5 16
day 0
month 0
target 0
dtype: int64

Excellent! This means when we use the inverse method, df.notna() what should happen is that the number of missing values should be deducted from the number of rows for that specific column, right?

# how man non-null values? 
id 300000
bin_0 299990
bin_1 300000
bin_2 300000
bin_3 300000
bin_4 294010
nom_0 300000
nom_1 300000
nom_2 300000
nom_3 300000
nom_4 294010
nom_5 300000
nom_6 300000
nom_7 300000
nom_8 300000
nom_9 300000
ord_0 300000
ord_1 300000
ord_2 300000
ord_3 300000
ord_4 299984
ord_5 299984
day 300000
month 300000
target 300000
dtype: int64


To see only the rows with a specific missing observation, we can pass a conditional statement into the bracket ([]) after calling the dataframe like so…

# viewing only the rows where bin_4 = NaN 
Figure 5: Rows where bin_4 is a missing value

Without going too deep into how we can deal with missing values — this may be for another post — the next few lines of code will show the easiest way (although not always the best way) to deal with missing values… drop them!

# priginal df shape
>>>> (300000, 25)
# df shape after dropping rows with any missing values
>>>> (293984, 25)
# df shape after dropping rows with all missing values
>>>> (300000, 25)
# dropping any row if it has NaN in ord_4 or ord_5
df.dropna(subset=["ord_4", "ord_5"], how="any").shape
>>>> (299984, 25)

Some of you may of noticed that when I returned the rows for which bin_4 has missing values in Figure 5, the bin_4 column returned a None values instead of NaN, don’t worry if you did! Here is another example…

0 None
20 None
40 None
60 None
80 None
100 None
120 None
140 None
160 None
180 None
200 None
220 None
240 None
260 None
280 None
300 None
Name: ord_4, dtype: object
2 NaN
102 NaN
202 NaN
302 NaN
402 NaN
502 NaN
602 NaN
702 NaN
802 NaN
902 NaN
Name: bin_0, dtype: float64

Remember what I said about np.nan earlier? Good. It turns out that NaN is of data type float therefore when it is used in an object data type column it converts it to None and when used on int it converts it the dtype to float, as the example code above shows.