Skip to content
Search
Generic filters
Exact matches only

5 Powerful Visualisation with Pandas for Data Preprocessing | by Kaushik Choudhury | Aug, 2020

Autocorrelation plot

Autocorrelation plots are a quick litmus test to ascertain whether the data points are random. In case the data points are following a certain trend, then one or more of the autocorrelations will be significantly non-zero. The dotted line in the plot shows 99%, confidence band.

In the code below, we are checking whether the total_bill amount in the “tips” database is random.

autocorrelation_plot(MealDatabase.total_bill)
plt.show()

We can see that the autocorrelation plot is moving very close to zero for all time-lags suggesting that the total_bill data points are random.

Plotted with the code mentioned in the article

When we plot the autocorrelation plot for data points following a particular order, we can see that the plot is significantly non-zero.

data = pd.Series(np.arange(12,7000,16.3))
autocorrelation_plot(data)
plt.show()
Plotted with the code mentioned in the article

Lag Plots

Lag plots are also helpful to verify if the dataset is a random set of values or follows a certain trend.

When the lag plot of “total_bills” value from “tips” dataset is plotted, as in the autocorrelation plot, the lag plot suggests it as random data with values all over the place.

lag_plot(MealDatabase.total_bill)
plt.show()
Plotted with the code mentioned in the article

When we lag plot a non-random data series, as shown in the code below, we get a nice smooth line.

data = pd.Series(np.arange(-12*np.pi,300*np.pi,10))
lag_plot(data)
plt.show()
Plotted with the code mentioned in the article

Parallel coordinates

It is always a challenge to wrap our head around and visualize more than 3-dimensional data. To plot higher dimension dataset parallel coordinates are very useful. Each dimension is represented by a vertical line.

In parallel coordinates, “N” equally spaced vertical lines represents “N” dimensions of the dataset. The position of the vertex on the n-th axis corresponds to the n-th coordinate of the point.

Confusing!

Let us consider a small sample data with five features for small and large size widgets.

A vertical line represents each feature of the widget. A continuous series of line segments represent “small” and “large” widgets’ feature values.

Plotted with the code mentioned in the article

Below code plots the parallel coordinates for “attention” dataset in seaborn. Please note that points that cluster appears closer together.

parallel_coordinates(AttentionDatabase,"attention",color=('#556270', '#C7F464'))
plt.show()
Plotted with the code mentioned in the article

I hope you will start using these out of box plots to perform the exploratory data analysis if you already are not using it. I would love to hear your favourite visualization plots for EDA.

In case, you would like to learn a structured approach to identify the appropriate independent variables to make accurate predictions then read my article “How to identify the right independent variables for Machine Learning Supervised.

"""Full code"""import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
import seaborn as sns
from pandas.plotting import scatter_matrix
from pandas.plotting import autocorrelation_plot
from pandas.plotting import parallel_coordinates
from pandas.plotting import lag_plot
CarDatabase= sns.load_dataset("mpg")
MealDatabase= sns.load_dataset("tips")
AttentionDatabase= sns.load_dataset("attention")
plt.scatter(CarDatabase.acceleration ,CarDatabase.horsepower, marker="^")
plt.show()
CarDatabase.plot.hexbin(x='acceleration', y='horsepower', gridsize=10,cmap="YlGnBu")
plt.show()
sns.heatmap(CarDatabase.corr(), annot=True, cmap="YlGnBu")
plt.show()
autocorrelation_plot(MealDatabase.total_bill)
plt.show()
data = pd.Series(np.arange(12,7000,16.3))
autocorrelation_plot(data)
plt.show()
lag_plot(MealDatabase.total_bill)
plt.show()
data = pd.Series(np.arange(-12*np.pi,300*np.pi,10))
lag_plot(data)
plt.show()
parallel_coordinates(AttentionDatabase,"attention",color=('#556270', '#C7F464'))
plt.show()