Choosing the Right Graphs for Your Feature Variables
When analyzing your data for, say, determining the type of regression you wish to use, it is important to first figure what kind of data you actually have. In fact, we should all do a data exploration before proceeding with any form of analysis as it could save us a great deal of work later on. For example, we could have accidentally chosen the wrong regression model because there was an unforeseen interaction between variable one and variable five, and this could have been prevented if we took a closer look at our data beforehand. Data analysis depends heavily on your feature variable types, how they are distributed, how they are related to each other, etc.
Before proceeding with any data analysis, we must first make a distinction between quantitative and qualitative/categorical variables.
Quantitative variables are variables that can be measured, and they are expressed numerically. On the other hand, categorical variables are descriptive and typically take on values such as names or labels. Qualitative data can be grouped based on similar characteristics, thus being categorical.
All the graphs mentioned can easily be plotted in Python with the Seaborn library (you can do this with Matplotlib as well if you wish), or in R with ggplot.
We must first start by loading our data into Python as a dataframe. Here, I am loading it from a csv file in the same directory.
import pandas as pd
import seaborn as snsdata = pd.read_csv("filename.csv", sep=" ", header="infer")
Or load it into R as a dataframe.
library(tidyverse)data <- read_csv("filename.csv")
If you wish to visualize a single categorical variable, you should use a bar chart where the x-axis would be the variable and the y-axis will be a count axis.
sns.catplot(x = "categorical var", kind = "count", data = data)
ggplot(data, aes(x = categorical var)) + geom_bar()
Grouped Bar Charts
If we have two categorical variables, we will proceed with a grouped bar chart. This is grouped as in it is grouped by that second categorical variable, usually, the one that has fewer categories.
sns.catplot(x = "categorical var1", hue = "categorical var2", kind = "count", data = data)
ggplot(data, aes(x = categorical var1, fill = categorical var2)) + geom_bar(position = "dodge")
Histograms are great for visualizing a quantitative variable. Here, we want to make sure we choose an appropriate number of bins to best represent the data. This number is easily selected based on past experience, playing around with the number of bins, or using an objective bin-selection formula such as Sturges Rule.
sns.distplot(data["quantitative var"], bins = 10, kde = False)
ggplot(data, aes(x = quantitative var)) + geom_histogram(bins = 10)
When we have one quantitative and one qualitative variable, we will use a side-by-side boxplot to best showcase the data.
sns.boxplot(x = "categorical var", y = "quantitative var", data = data)
ggplot(data, aes(x = categorical var, y = quantitative var)) + geom_boxplot()
Grouped boxplots are used when we have two categorical variables and a single quantitative one. Let the grouping be done on the categorical variable with the fewer groups.
sns.boxplot(x = "categorical var1", y = "quantitative var", hue = "categorical var2", data = data)
ggplot(data, aes(x = categorical var1, y = quantitative var, fill = categorical var2)) + geom_boxplot()
Scatterplots are needed to visualize one quantitative variable against another. This is quite common to evaluate the type of relationship that exists between a quantitative feature variable / explanatory variable and a quantitative response variable, where the y-axis always holds the response variable.
sns.scatterplot(x = "explanatory variable", y = "response variable", data = data)
ggplot(data, aes(x = explanatory variable, y = response variable)) + geom_point()
Scatterplot by Group
If we are trying to visualize two quantitative variables and one categorical one, we will use a scatterplot with its points grouped by the categorical variable.
sns.scatterplot(x = "explanatory variable", y = "response variable", hue = "categorical var", data = data)
ggplot(data, aes(x = explanatory variable, y = response variable, color = categorical var)) + geom_point()