Examples with Pandas
Note: For this task we will be using a dataset from Kaggle, specifically, the Categorical Feature Encoding Challenge. The code can be found in the link below.
We will start by reading in the data using
pd.read_csv — Documentation.
import pandas as pd # load and peek at data
df = pd.read_csv("../data/categorical_feature_engineering_raw/train.csv")
In Pandas, a Python framework for data manipulation, missing values are represented as
None, and there are multiple ways of checking whether we have any present in our data:
Yes, I know what you’re thinking, Which one should I use? Well, the biggest difference you’ll find between them is that 4 are top level functions and the other 4 are methods of pandas dataframe class (
pd.DataFrame.isna()). Inherently, at the dataframe level
df.isnull() are the same likewise with
df.notnull() (they literally have the same documentation)— this phenomena holds true for the top level functions also.
Why have multiple methods with different names do the same thing? Pandas is built on top of NumPy which uses
np.isnan() to detect missing values, ergo it has neither
Null values. However, Pandas uses DataFrames that are based on R DataFrames and in R
Null are different things — Read more about R
Null here and see the source of my answer here.
So I guess in that sense we have the option to make Pandas our own, a common theme in the Pandas framework.
I typically use
df.isna() and the inverse
df.notna(), mainly because it has less characters to type than
df.notnull(). Additionally, I prefer to have access straight to the DataFrame with dot notation, which rules out the top level functions for me (they require you to pass the DataFrame, or Series object as a parameter).
# see what .isna() returns
as a sanity check, I’ll do the same with
Okay, It’s the same (did you notice you had to type a few extra letters for the top level function route — Yes, I am lazy). Now we can check if
df.notna() is truly the inverse of
# see what .notna() returns
Even though the Documentation for
df.notna() states that what is returned by this method is exactly the same thing as
df.isna() (I personally find it confusing),
df.notna() is indeed the inverse.
Beyond that point, let’s dig further into our data to check for missing values.
Oooops, looks like we have a Dataset that has no missing values — this is very unlikely to be the case in real-world scenarios, so we will leverage some skills that we learnt in PyTrix Series episode 4 to add some missing values — link below.
# inputting NaN's randomly into data
df.loc[2:1000:100, ["bin_0"]] = None
df.loc[:300:20, ["ord_4", "ord_5"]] = None
df.loc[500::50, ["nom_4", "bin_4"]] = Nonedf.isna().sum()>>>>
Excellent! This means when we use the inverse method,
df.notna() what should happen is that the number of missing values should be deducted from the number of rows for that specific column, right?
# how man non-null values?
To see only the rows with a specific missing observation, we can pass a conditional statement into the bracket (
) after calling the dataframe like so…
# viewing only the rows where bin_4 = NaN
Without going too deep into how we can deal with missing values — this may be for another post — the next few lines of code will show the easiest way (although not always the best way) to deal with missing values… drop them!
# priginal df shape
>>>> (300000, 25)# df shape after dropping rows with any missing values
>>>> (293984, 25)# df shape after dropping rows with all missing values
>>>> (300000, 25)# dropping any row if it has NaN in ord_4 or ord_5
df.dropna(subset=["ord_4", "ord_5"], how="any").shape
>>>> (299984, 25)
Some of you may of noticed that when I returned the rows for which bin_4 has missing values in Figure 5, the bin_4 column returned a
None values instead of
NaN, don’t worry if you did! Here is another example…
Name: ord_4, dtype: objectdf[df["bin_0"].isna()]["bin_0"]>>>>
Name: bin_0, dtype: float64
Remember what I said about
np.nan earlier? Good. It turns out that
NaN is of data type float therefore when it is used in an object data type column it converts it to
None and when used on int it converts it the dtype to float, as the example code above shows.