## Examples with Pandas

Note: For this task we will be using a dataset from Kaggle, specifically, the

Categorical Feature Encoding Challenge.The code can be found in the link below.

We will start by reading in the data using `pd.read_csv`

— Documentation.

import pandas as pd# load and peek at data

df = pd.read_csv("../data/categorical_feature_engineering_raw/train.csv")

df.head()

In Pandas, a Python framework for data manipulation, missing values are represented as `Nan`

or `None`

, and there are multiple ways of checking whether we have any present in our data:

**pd.isnull()****pd.notnull()****pd.isna()****pd.notna()****df.isna()****df.notna()****df.isnull()****df.notnull()**

Yes, I know what you’re thinking, *Which one should I use? *Well, the biggest difference you’ll find between them is that 4 are top level functions and the other 4 are methods of pandas dataframe class (`pd.DataFrame.isna()`

). Inherently, at the dataframe level `df.isna()`

and `df.isnull()`

are the same likewise with `df.notna()`

and `df.notnull()`

(they literally have the same documentation)— this phenomena holds true for the top level functions also.

Why have multiple methods with different names do the same thing? Pandas is built on top of NumPy which uses `np.isnan()`

to detect missing values, ergo it has neither

or **Na**

values. However, Pandas uses DataFrames that are based on R DataFrames and in R **Null**

and **Na**

are different things — Read more about R **Null**

and**Na**

here and see the source of my answer here.**Null**

So I guess in that sense we have the option to make Pandas our own, a common theme in the Pandas framework.

I typically use `df.isna()`

and the inverse `df.notna()`

, mainly because it has less characters to type than `df.isnull()`

and `df.notnull()`

. Additionally, I prefer to have access straight to the DataFrame with dot notation, which rules out the top level functions for me (they require you to pass the DataFrame, or Series object as a parameter).

*# see what .isna() returns*

df.isna()

as a sanity check, I’ll do the same with `pd.isna()`

…

`pd.isna(df)`

Okay, It’s the same (did you notice you had to type a few extra letters for the top level function route — Yes, I am lazy). Now we can check if `df.notna()`

is truly the inverse of `df.isna()`

…

*# see what .notna() returns*

df.notna()

Even though the Documentation for `df.notna()`

states that what is returned by this method is exactly the same thing as `df.isna()`

(I personally find it confusing), `df.notna()`

is indeed the inverse.

Beyond that point, let’s dig further into our data to check for missing values.

df.isna().sum()>>>>

id 0

bin_0 0

bin_1 0

bin_2 0

bin_3 0

bin_4 0

nom_0 0

nom_1 0

nom_2 0

nom_3 0

nom_4 0

nom_5 0

nom_6 0

nom_7 0

nom_8 0

nom_9 0

ord_0 0

ord_1 0

ord_2 0

ord_3 0

ord_4 0

ord_5 0

day 0

month 0

target 0

dtype: int64

Oooops, looks like we have a Dataset that has no missing values — this is very unlikely to be the case in real-world scenarios, so we will leverage some skills that we learnt in PyTrix Series episode 4 to add some missing values — link below.

# inputting NaN's randomly into data

df.loc[2:1000:100, ["bin_0"]] =None

df.loc[:300:20, ["ord_4", "ord_5"]] =None

df.loc[500::50, ["nom_4", "bin_4"]] =Nonedf.isna().sum()>>>>

id 0

bin_0 10

bin_1 0

bin_2 0

bin_3 0

bin_4 5990

nom_0 0

nom_1 0

nom_2 0

nom_3 0

nom_4 5990

nom_5 0

nom_6 0

nom_7 0

nom_8 0

nom_9 0

ord_0 0

ord_1 0

ord_2 0

ord_3 0

ord_4 16

ord_5 16

day 0

month 0

target 0

dtype: int64

Excellent! This means when we use the inverse method, `df.notna()`

what should happen is that the number of missing values should be deducted from the number of rows for that specific column, right?

# how man non-null values?

df.notna().sum()>>>>

id 300000

bin_0 299990

bin_1 300000

bin_2 300000

bin_3 300000

bin_4 294010

nom_0 300000

nom_1 300000

nom_2 300000

nom_3 300000

nom_4 294010

nom_5 300000

nom_6 300000

nom_7 300000

nom_8 300000

nom_9 300000

ord_0 300000

ord_1 300000

ord_2 300000

ord_3 300000

ord_4 299984

ord_5 299984

day 300000

month 300000

target 300000

dtype: int64

Exactly!

To see only the rows with a specific missing observation, we can pass a conditional statement into the bracket (`[]`

) after calling the dataframe like so…

*# viewing only the rows where bin_4 = NaN *

df[df["bin_4"].isna()]

Without going too deep into how we can deal with missing values — this may be for another post — the next few lines of code will show the easiest way **(although not always the best way)** to deal with missing values… drop them!

# priginal df shape

df.shape

>>>> (300000, 25)# df shape after dropping rows with any missing values

df.dropna(how="any").shape

>>>> (293984, 25)# df shape after dropping rows with all missing values

df.dropna(how="all").shape

>>>> (300000, 25)# dropping any row if it has NaN in ord_4 or ord_5

df.dropna(subset=["ord_4", "ord_5"], how="any").shape

>>>> (299984, 25)

Some of you may of noticed that when I returned the rows for which bin_4 has missing values in *Figure 5, *the bin_4 column returned a `None`

values instead of `NaN`

, don’t worry if you did! Here is another example…

df[df["ord_4"].isna()]["ord_4"]>>>>

0 None

20 None

40 None

60 None

80 None

100 None

120 None

140 None

160 None

180 None

200 None

220 None

240 None

260 None

280 None

300 None

Name: ord_4, dtype: objectdf[df["bin_0"].isna()]["bin_0"]>>>>

2 NaN

102 NaN

202 NaN

302 NaN

402 NaN

502 NaN

602 NaN

702 NaN

802 NaN

902 NaN

Name: bin_0, dtype: float64

Remember what I said about `np.nan`

earlier? Good. It turns out that `NaN`

is of data type float therefore when it is used in an object data type column it converts it to `None`

and when used on int it converts it the dtype to float, as the example code above shows.