Charts to compare different objects

Hi everyone, great to see you here!

This is my 5th blog on a series of data visualization with charts for specific purposes.

Sometimes, when doing data analysis, we want to have a comparison between 2 (or among some) objects on more than 1 criterion. For example, for the 4 seasons in a year, how are the temperature and humidity compared? In these cases, draw a plot should be a good choice.

For this purpose, we have:

Bar chart
Radar chart
Lollipop plot
Parallel Coordinate plot

First, let’s, as usual, import our beautiful libraries.

import numpy as np
import pandas as pd
import scipy as sp
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns

And we can (optionally) set some default parameters for our plots. I usually define the figure size and style as below.

# set default figure-size
matplotlib.rcParams['figure.figsize'] = (12, 8)
# set default style
plt.style.use('seaborn-darkgrid')

In case you want to select another style, you can see a list by running the below command.

print(plt.style.available)

To begin, we will create a data frame.

# create dataframe
data = pd.DataFrame({\
    'var1' : ['A', 'B', 'C', 'D', 'E'], \
    'var2' : [8, 5, 7, 4, 9], \
    'var3' : [9, 14, 12, 17, 13], \
    'var4' : ['X', 'Y', 'X', 'Y', 'X'], \
                    })

# show the data
print(data)

  var1  var2  var3 var4
0    A     8     9    X
1    B     5    14    Y
2    C     7    12    X
3    D     4    17    Y
4    E     9    13    X

Bar chart

For the Bar chart, we use the classic matplotlib library.

In this example, we draw the values of column var2 for each object of var1. For illustration purposes, let’s suppose var1 stores the names of people, while var2 is their corresponding height. Hence, the following Bar chart shows the height of each person in our dataset.

# create a figure
fig, ax = plt.subplots()

# plot a bar-chart
ax.bar(range(len(data)), # x-axis positions
       data['var2'], # y-axis values
       color='lightgreen', # color of the bars
       edgecolor='green', # color of the bar edges
       linewidth=2, # thickness of the edges
       tick_label = data['var1'] # x-tickers
      )

# write the value above each bar
for i, value in enumerate(data['var2']):
    ax.text(i, # x-position of the text, which is bar-index
            value + 0.1, # y-position, just above the bar
            value, # value of the bar
            horizontalalignment='center', # align text
           )

# set title for the plot
ax.set_title('A sample bar plot', # title
             size=20, # title text-size
            )
ax.set_xlabel('Variable 1') # set label for x-axis
ax.set_ylabel('Variable 2') # set label for y-axis
    
# show the plot
plt.show()

It is quite simple, isn’t it?

Most of the code above is for aesthetic purposes, while the main code for drawing the bars occupies just several lines.

For more complex queries, like comparing objects on 2 criteria, Bar charts can also get the work done. However, it is worth noting that there is some manual work we have to do in order to make a grouped-bar-chart.

In particular, we have to set: a width for each bar, the space size between 2 bars, the positions where the bars are placed.

# create a figure
fig, ax = plt.subplots()

bar_width = 0.4 # width of each bar
space = 0.04 # space between 2 bars
position1 = [i for i in range(len(data))]
position2 = [pos + bar_width + space for pos in position1]

# plot the first bar-chart
ax.bar(position1, # x-axis positions
       data['var2'], # y-axis values
       color='lightgreen', # color of the bars
       tick_label=data['var1'], # x-tickers
       width=bar_width, # set bars' width
       label='Var2', # label of green bars
      )

# plot the second bar-chart
ax.bar(position2, # x-axis positions
       data['var3'], # y-axis values
       color='lightblue', # color of the bars
       width=bar_width, # set bars' width
       label='Var3' # label of blue bars
      )

# set title for the plot
ax.set_title('A sample bar plot', # title
             size=20, # title text-size
            )

# show legend
plt.legend()

# show the plot
plt.show()

That is all with the Bar charts from Matplotlib.

Seaborn, on the other hand, as being an enhanced version of Matplotlib, gives us an additional feature for the Bar chart, which is aggregating. A bar can represent not only a value but a list of values in terms of mean and standard deviation.

In the subsequent example, all the samples (rows) which have var4 equals X are grouped, taken the mean and std of var2 and drawn as the blue bar. The orange bar is made by the same process with var4 equals Y. The height of the bars depicts the mean, while the black line on the top of each bar indicates the standard deviation.

# plot an aggregating-bar-graph
ax = sns.barplot(data=data, x='var4', y='var2')

# show the plot
plt.show()

Radar chart

The radar chart, or Spider chart, is one of the fancy visualizations.

In the Radar chart, we start at the center of a polar axis. Each direction represents a criterion. The farther we go on a direction, the larger the value of that criterion the object has.

Unfortunately, our close friends Matplotlib and Seaborn do not support us in making Radar charts. Hence, we have to ask for it from another buddy, called Plotly.

Plotly is a versatile library for plotting, especially in the case of interactive plots. As its plots are very beautiful and interesting, I would highly encourage you guys to give a try!

from plotly.express import line_polar

To make a demo of the Radar chart would require a larger dataset, so let us re-create our data by adding some more values to it.

# create dataframe
data = pd.DataFrame({\
    'var1' : ['A', 'B', 'C', 'D', 'B', 'C', 'D', 'A'], \
    'var2' : [10, 6, 3, 5, 5, 7, 8, 6], \
    'var3' : [9, 14, 12, 17, 13, 7, 10, 9], \
    'var4' : ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'], \
                    })

# sort our dataframe by var1
# to ensure the polygon we are going to draw is convex
data = data.sort_values(by=['var1'])

Now is the time to draw!

The syntax for calling the plot function is quite simple. With the comments attached to each line of code, the below snippet is supposedly self-explanatory.

fig = line_polar(data[data['var4'] == 'X'], # input dataframe
                 theta='var1', # the criteria
                 r='var2', # the magnitude of values
                 line_close=True, # the line is closed
                 width=400, # width of the figure
                 height=400 #height of the figure
                )
fig.update_traces(fill='toself') # fill the closed-line shape

# show the figure
fig.show()

Notice that we have a new parameter – color – here, which acts the same as hue on Seaborn.

from plotly.express import line_polar
fig = line_polar(data,  # input dataframe
                 theta='var1', # the criteria
                 r='var2', # the magnitude of values
                 color='var4', # rows with the same 
                 # var4's values are draw together as a closed-line
                 line_close=True, # the line is closed
                 width=400, # width of the figure
                 height=400 #height of the figure
                )
fig.update_traces(fill='toself') # fill the closed-line shape

# show the figure
fig.show()

Lollipop plot

The third chart we are gonna see on this blog is the Lollipop plot.

This is a variant of the Scatter plot. The scatter-points of the same color belong to the same object and represent the magnitude of that object for a criterion. The difference in each criterion of 2 objects is depicted by a grey line. The longer the grey line, the larger the difference.

# number of criteria to compare
n_criteria = data['var1'].nunique()

# get separate dataframe for each object
X_data = data[data['var4'] == 'X']
Y_data = data[data['var4'] == 'Y']

# make a new figure
fig, ax = plt.subplots()

# draw scatter points of each object
ax.scatter(X_data['var2'], # x-axis, magnitude of values
           range(1, n_criteria+1), # y-axis, criteria
           color='cyan', # color of scatter-points
           s=50, # sizeo of scatter-points
           label='X' # this object is X
          )
ax.scatter(Y_data['var2'], # x-axis, magnitude of values
           range(1, n_criteria+1), # y-axis, criteria
           color='lightgreen', # color of scatter-points
           s=50, # sizeo of scatter-points
           label='Y' # this object is X
          )

# draw horizontal-lines connect the points
ax.hlines(y=range(1, n_criteria+1), # height of the line
          xmin=X_data['var2'], # endpoint of the lines  
          xmax=Y_data['var2'], # endpoint of the lines  
          color='grey' # color of the lines
         )

# set tickers for y-axis
ax.set_yticks(range(1, n_criteria+1))
ax.set_yticklabels(data['var1'].unique())

# set limit for x-axis
ax.set_xlim(0, 12)

# set title and x-label, y-label for the figure
ax.set_title('Compare object X versus object Y', size=20)
ax.set_xlabel('var2', size=15)
ax.set_ylabel('var1', size=15)

# show legend
plt.legend()

# show plot
plt.show()

Parallel Coordinate plot

Notice that the above visualizations can hardly be used if the value ranges of the criteria are different, which is their big drawback.

To solve this co-ordinate scaling problem, a candidate is Parallel Coordinate Plot.

In this plot, each criterion (or co-ordinate) can have a different scale.

Each object is represented by a line, starting from the first and end at the last co-ordinate bar.

# import plotly and parallel_coordinates
import plotly.express as px
from plotly.express import parallel_coordinates

# load iris data
iris = px.data.iris()

# draw the plot
fig = parallel_coordinates(
    data_frame=iris, # data
    dimensions=['petal_width', 'petal_length', \
                'sepal_width','species_id'], # criteria
    color="species_id", # different species has different color
    color_continuous_scale=px.colors.diverging.Temps, # color pallete
    title="This is a Parallel Coordinate Plot", # title of the plot
    width=400, # width of the figure
    height=600 #height of the figure
                          )

# show the figure
fig.show()

On a side note, Plotly, compared to Matplotlib and Seaborn:

has some big advantages from its interactive features and gorgeous visualization.
has some additional charts.
is harder for users to customize.
consume more CPU resources to render. (But this is not a big deal with modern computers)

Conclusion

It is coming to the end!

In this blog, we exhibited 4 types of charts for the aim of comparing different objects on different criteria.

The bar chart is the most simple one, which is very popular in textbooks. It is easy to understand and very straightforward.

The Radar chart is attractive and fancy. However, we should use it with caution because it can mislead audiences about the main point. The audiences may have the wrong perception looking at its shape, its area and symmetricity, while these characteristics depend mostly on the order of the criteria, but not the values themselves.

The Lollipop plot is simple and elegant. A very strong point of the Lollipop is that it can work with a large number of criteria, but a weak point is that it would not perform well with many objects. In most cases, the Lollipop plot is good for 2 or 3 objects.

The Parallel Coordinate plot is quite different. It is able to support a very large number of objects, and a moderate number of criteria (from our experience, 5 or 7 criteria is still good). It has a scale for each criterion, which is a plus point over the preceding 3 charts. A downside is that it is a bit more complex, taking more time for the audience to grasp the sense of the data.

References:

Matplotlib’s Bar chart: link
Seaborn’s Bar chart: link
Plotly’s Radar chart: link
Matplotlib’s scatter plot: link
Plotly’s parallel coordinate plot: link

Tung M Phung's Blog

Charts to compare different objects

Leave a ReplyCancel reply