Load data set and quick data exploration
For simplicity purposes, we will use the Iris data set that can be loaded from a scikit-learn library using the following code:
from sklearn.datasets import load_iris
import pandas as pddata = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['species'] = data['target']
As you can see we have a data set with five columns only. Let’s call info() function on the data frame for quick analysis:
As you can see there are only 150 entries, there are no missing values in any of the columns.
Additionally, we have learnt that the first four columns have float values whereas the last column allows integers only. In fact from the data set description we know that species column will take only three values each one representing one type of flower.
To confirm this you can call the unique() function on that column:
array([0, 1, 2])
Indeed species column take only three values: 0, 1, and 2.
Knowing this basic information about our data set we can proceed to visualizations. Note that if there were some missing values in the columns you should either drop them or fill them in. This is because some of the techniques we will discuss later on will not allow for missing values.