There are many situations where one needs a bar-graph which displays some statistics for different categories under different conditions. In my case, I am interested in how well different programs predict the structures of RNA molecules. Thus the data can be partitioned into the categories (the RNA structures) and the conditions (the prediction programs):
The Plot
With matplotlib, we can create a barchart but we need to specify the location of each bar as a number (x-coordinate). If there was only one condition and multiple categories, this position could trivially be set to each integer between zero and the number of categories. We would want to separate each bar by a certain amount (say space = 0.1 units). Thus we can define the width of each bar to be width = 1 - space its left-most position as pos= j - (width / 2) where j is the x-coordinate where it should be centered.
If we have n conditions, then we want to place n bars in such a manner that they are centered around j. Note that I keep referring to j since that is where we will place the x-axis labels. So, with n bars, the width of each will be width = (1 - space) / n and its left-most position will be pos = j - (1 - space) / 2 + i * width.
To create the chart, we simply iterate over the conditions and place bars at their prescribed positions:
This will yield a very bare bones plot like this:
So now the bars are in their appropriate places, but it’s missing some of the essentials of a chart:
Axis ticks and labels
The ticks should correspond to the category names and should be centered under each group of bars. I like to turn them 90 degrees so that they don’t overlap, although in this case it’s not an issue.
Labels are required to show what we are actually representing.
Colors and legend
The barebones plot does not distinguish between the different conditions. We need to color each bar and add a legend to inform the viewer which bar corresponds to which condition. The legend will be created by first adding a label to each bar command and then using some matplotlib magic to automatically create and place it within the plot.
The colors will be chosen using a colormap designed for categorical data (colormap.Accent). Thus the original ax.bar function call will be changed to the following:
And the legend will be created with the following two lines:
This yields a respectable looking bar chart:
Bar Arrangement
There is one thing that bothers me. The locations of the bars are scattered at the whim of the initial data set. Since the primary purpose of making this plot was to compare different categories and conditions, I would like the locations of the bars and the categories to be ordered to reflect the data. More specifically, the positions of the categories on the x-axis should be ordered by the average values for all conditions of that category and the positions of the bars for each category should be equal to the average value of the condition over all categories.
To do this, I will first calculate the aggregate values for each category and condition and then sort them:
Then I will sort the original data set so that the data is ordered in accordance with the sorted categories:
With this done, I continue creating the plot as before. For convenience, I’ve pasted the resulting plot: