An important concept in statistics and data science is distribution. Distribution generally refers to the probability of occurrence of an outcome. In a distribution of 100 coin flips how many will get heads and how many tails? Frequency distributions like this are presented in histograms or curves.
Below is a representation of studnets’ heights distribution in a swimming class. The x-axis shows different height categories and y-axis has the number of students in each category.
That’s a frequency distribution. But there is another kind of distribution — better known as dispersion — which shows how a variable is dispersed/spread with respect to its central tendency.
A classic representation of dispersion is the boxplot.
The boxplot above represents the distribution of the number of air passengers on Saturdays over a number of years. This single plot reveals so much information — the mean/median number of passengers on Saturdays, the minimums and maximums, the outliers and more!
Trees grow taller as they get older in the early years. That’s a relationship between two variables — height and age.
height = f(age)
In another example, the price of a house depends on the number of beds, number of bathrooms, location, square footage etc. This is a relationship between one dependent and many explanatory variables.
price = f(beds, baths, location, area)
If you look at a dataset just as numbers, there is no way to identify these relationships. But in fact, you can, without going into complex statistical analysis, with the help of a good visualization.
The third cornerstone of data visualization is Comparison. This kind of visual material compares multiple variables in datasets or multiple categories within a single variable.
Let’s check out the following two visuals:
The one on left compares a variable (salary) between two groups of observations (scientists vs lawyers) on a bar chart. The right panel is also a comparison chart — in this case, comparing a variable (GDP)between two groups (UK and Canada) but along a time dimension.
Have you heard about stacked bar charts? But I’m sure you know what a pie chart is.
The purpose of these charts is to show the composition of one or more variables in absolute numbers and in normalized forms (e.g. percentage).
Composition charts are some of the old school visualization techniques that nowadays have limited use cases (do you really need a pie chart to show a composition of yellow 10% and red 15%?). Nevertheless, sometimes they can present information in a visually aesthetic and familiar, vintage fashion.