Charts to show the distribution

A beautiful sight

Hi everyone, this is my 2nd blog on a series of data visualization with charts for specific purposes.

This time, we are going to examine the distribution of the data. The charts suitable for this goal are:

  • Histogram
  • Box-plot
  • Violin plot

First, let’s, as usual, import our beautiful libraries.

import numpy as np
import pandas as pd
import scipy as sp
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns

And we can (optionally) set some default parameters for our plots. I usually define the figure size and style as below.

# set default figure-size
matplotlib.rcParams['figure.figsize'] = (12, 8)
# set default style
plt.style.use('seaborn-darkgrid')

In case you want to select another style, you can see a list by running the below command.

print(plt.style.available)

Histogram

The histogram is used very frequently.

It puts our data to bins and then draws a bar-graph of the count of those bins. It also supports drawing counts in the form of density. That is, each count is divided by the total number before plotting.

# Data to plot
values = np.random.normal(0, 1, 1000)

# Draw a histogram
fig, ax = plt.subplots()
sns.distplot(values, hist=True)

# Draw a normal distribution curve
mu, sigma = sp.stats.norm.fit(values)
x_line = np.linspace(min(values), max(values), 1000)
y_line = sp.stats.norm.pdf(x_line, mu, sigma)

ax.plot(x_line, y_line, color='lightblue')
plt.show()
histogram

In the figure above, aside from the histogram, we also draw 2 lines:

  • The bold-blue line is just a smooth version of the histogram.
  • The light-blue line represents a normal distribution curve, with its mean and standard deviation calculated from our given data.

By comparing the 2 lines (by compare, I mean to check if the 2 lines coincide), we can verify how likely that our data follows a normal distribution. In this example, we can see that our 2 lines highly match each other, suggesting that our data seems to be nearly normally distributed.

Box-plot

Box-plot is quite a versatile plot. We often use box-plots to inspect the distribution of the data and to detect outliers.

Seaborn does a really good job of innovating box-plot. It is not just more handsome but also supports more functions than in the original version from matplotlib.

In the below code, we create a data frame that has 3 numerical columns and 2 categorical columns. Then, we draw a box-plot for all the numerical columns.

# Data to plot
data = pd.DataFrame({\
    'Var1' : np.random.normal(5, 5, 1000), \
    'Var2' : np.random.normal(7, 3, 1000), \
    'Var3' : np.random.normal(9, 7, 1000), \
    'Var4' : np.random.choice(['A', 'B', 'C', 'D'], size=1000, replace=True), \
    'Var5' : np.random.choice(['X', 'Y'], size=1000, replace=True)
                    })

# Draw box-plot
fig, ax = plt.subplots()
ax = sns.boxplot(data=data, whis=1.5)
ax.set_title('This is a box-plot', fontsize=20)

plt.show()
box-plot

We have 3 boxes, those are the blue, orange and green boxes. Each of them represents a variable.

For each box:

  • The lower and upper horizontal edges of the box represent the First Quantile (Q1) and the Third Quantile (Q3) of the data, while the edge in between shows the Second Quantile (Q2 – or Median).
  • The height of the box is called Inter Quantile Range (IQR), which equals to Q3 – Q1.
  • The highest and lowest horizontal bars, which are outside of the box, also called the valid range, depict (Q1 – 1.5IQR) and (Q3 + 1.5IQR), respectively. Values that fall outside of these 2 bars are considered outliers and be plotted as diamond symbols. If you want to set the width to other values instead of 1.5, define it with the ‘whis‘ parameter.
explanation of box-plot

Here we go to a little bit more complex box-plot:

# Draw box-plot
fig, ax = plt.subplots()
ax = sns.boxplot(data=data, x='Var4', \
                 y='Var1', hue='Var5', \
                 palette=sns.color_palette('muted'))
ax.set_title('This is another box-plot', fontsize=20)

plt.show()
multi-box-plot

What do we have here?

Now we don’t plot all the numerical rows anymore. We are only plotting values from the column Var1, but those values are separated over Var4 and Var5.

The x-axis represents the values of Var4, while the colors (blue and orange) represent the values of Var5. This multi-box-plot helps us not only see the distribution of the values but also compare the values when they fall into different categories.

Violin-plot

Violin-plot is very similar to box-plot. It has every information box-plot has (because, actually, it contains a box-plot inside), along with some new pros:

  • It is more beautiful.
  • It even shows the density of values.

So does that mean the Violin-plot can take over Box-plot?
Not yet, as violin also has some cons:

  • Because it is more beautiful, it may distract users from the main information.
  • Because violin-plot uses inter- and extrapolation to “guess” the density, the violin’s 2 ends may go outside of the truth value range. For example, when we have a violin-plot to show a ton of percentage values (those values are in the range from 0 to 1), the 2 ends of our violin may go a little higher than 1 and lower than 0, which does not make sense. This drawback can be solved by setting the parameter “cut=0”, but this will make the plot a bit uglier.
  • Also because of the interpolation, if our number of values is small, the interpolation may be wrong, and the density we see in our violin is not the real density.
# Draw a violin-plot
fig, ax = plt.subplots()
ax = sns.violinplot(data=data)
ax.set_title('This is a violin-plot', fontsize=20)

plt.show()
violin-plot

Notes:

  • The commands to plot a violin-plot is very similar to box-plot.
  • Unfortunately, they are not absolutely the same. we can not set some parameters in violin-plot, for example, the “whis“.
  • The width of a violin-plot represents the density. It does not have an absolute unit to measure the density here, we can only compare the width relatively to know something like: oh, around this value the density is higher than around that value.
  • Look at the center of each violin, there is a box-plot there.

And here is a multi-violin-plot, which is equivalent to the multi-box-plot above.

# Draw a violin-plot
fig, ax = plt.subplots()
sns.violinplot(data=data, x='Var4', \
               y='Var1', hue='Var5', \
               cut=0, \
               palette=sns.color_palette('muted'))

ax.set_title('This is also a violin-plot', fontsize=20)

plt.show()
multi-violin-plot

References:

  • Seaborn’s histogram: link
  • Seaborn’s boxplot: link
  • Seaborn’s violin-plot: link

Leave a Reply